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Abstract 

This report presents an Expectation-Maximization (EM) algorithm for estimation of the maximum- 
likelihood parameter values of constrained multivariate autoregressive Gaussian state-space (MARSS) 
models. The MARSS model can be written: x(t)=Bx(t-l)+u+w(t), y(t)=Zx(t)+a+v(t), where w(t) 
and v(t) are multivariate normal error-terms with variance-covariance matrices Q and R respectively. 
MARSS models are a class of dynamic linear model and vector autoregressive model state-space model. 
Shumway and Stoffer presented an unconstrained EM algorithm for this class of models in 1982, and a 
number of researchers have presented EM algorithms for specific types of constrained MARSS models 
since then. In this report, I present a general EM algorithm for constrained MARSS models, where 
the constraints are on the elements within the paramater matrices (B,u,Q,Z,a,R). The constraints take 
the form vec(M)=f+Dm, where M is the parameter matrix, f is a column vector of fixed values, D is 
a matrix of multipliers, and m is the column vector of estimated values. This allows a wide variety of 
constrained parameter matrix forms. The presentation is for a time-varying MARSS model, where time- 
variation enters through the fixed (meaning not estimated) f(t) and D(t) matrices for each parameter. 
The algorithm allows missing values in y and partially deterministic systems where 0s appear on the 
diagonals of Q or R. 
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1 Overview 



EM algorithms extend maximum-likelihood estimation to models with hidden states and are widely used in 
engineering and computer science applications. This report presents an EM algorithm for a general class of 
Gaussian constrained multivariate autoregressive state-space (MARSS) models, with a hidden multivariate 
autoregressive process (state) model and a multivariate observation model. Thi s is an important class of 
time-s eries model used in many different scientific fields. The reader is r eferred to McLachlan and Krishnanl 
(2008) for general b ackground on E M algorithms and to Harvey ( 19891 ) for a discussion of EM algorithms 
for time-series data. Borman (2009) has a nice tutorial on the EM algorithm. 

Before showing the derivation for the constrained case, I first show a derivation of the EM algo rithm for 
unconstrained^ MARSS model. This EM algorith m was published by Shumwav and Stoffer (Il982l). but my 
deriv ation is more similar to Ghahramani et al's ( Ghahramani and Hinton . 19961 : Roweis and Ghahramani . 
1999) slightly different presentation. One difference in my presentation and all these previous presentations, 
however, is that I treat the data as a random variable throughout; this means that there are no "special" up- 
date equations for the missing values case. Another difference is that I present the update equations for both 
stochastic initial states and fixed initial states. I then extend the derivation to constrained MARSS models 
where there are fixed and shared elements in the parameter matrices and to the case of de generate MARSS 
mo dels where some pr ocesses in the model are deterministic rather than stochastic. See also lWu et al. (1996) 
and IZuur et al.l (|2003r ) for other examples of the EM algorithm for different classes of constrained MARSS 
models. 

When working with MARSS models, one should be cognizant that misspecification of the prior on the 
initial hidden states can have catastrophic and difficult to detect effects on the parameter estimates. There is 
often no sign that something is amiss with the MLE estimates output by an EM algorithm. There has been 
much work on how to avoid these initial conditions effects; see especially literature on vector autoregressive 
state-space models in the economics literature. The trouble often occurs when the prior on the initial states 
is inconsistent with the distribution of the initial states that is implied by the maximum-likelihood model. 
This often happens when the model implies a specific covariance structure on the initial states, but since 
the maximum-likelihood parameters are unknown, this covariance structure is unknown. Using a diffuse 
prior docs not help since your diffuse prior still has some covariance structure (often independence is being 
imposed). In some ways the EM algorithm is less sensitive to a mis-specified prior because it uses the 
smoothed states conditioned on all the data. However, if the prior is inconsistent with the model, the EM 
algorithm will not (cannot) find the MLEs. It is very possible however that it will find parameter estimates 
that are closer to what you intend (estimates uninfluenced by the prior), but they will not be MLEs. The 
derivation presented here allows one to circumvent these problems by treating the initial states as fixed (and 
estimated) parameters. The problematic initial state variance-covariance matrix is removed from the model, 
albeit at the cost of additional estimated parameters. 

Finally, when working with MARSS models, one needs to ensure that the model is identifiable, i.e. a 
unique solution exists. For a given MARSS model, some of the parameter elements will need to be fixed (not 
estimated) in order to produce a model with one solution. How to do that depends on the MARSS model 
being fitted and is up to the user. 



1.1 The MARSS model 

The linear MARSS model with a stochastic initial stated is 



x t = Bx f _i + u + w t , where W t - MVN(0, Q) 
y t = Zx t + a + v t , where V t ~ MVN(0, R) 
X ~ MVN(£, A) 



(la) 
(lb) 
(1c) 



1 "unconstrained" means that each element in the parameter matrix is estimated and no elements are fixed or shared. 

2l Stochastic' means the initial state has a distribution rather than a fixed value. Because the process must start somewhere, 
one needs to specify the initial state. In equation[T] I show the initial state specified as a distribution. However, the derivation 
will also discuss the case where the initial state is specified as an unknown fixed parameter. 
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The y equation is called the observation process, and y t is a n x 1 vector. The x equation is called the state 
or process equation, and Xt is a m x 1 vector. The equation for x describes a multivariate autoregressive 
process (also called a random walk or Markov process) . w are the process errors and are specific realizations 
of the random variable W; v is defined similarly. The initial state can either defined at t — 0, as is done 
in equation [TJ or at t = 1. When presenting the MARSS model, I use t = but the derivations will show 
the EM algorithm for both cases. Q and R are variance-covariance matrices that specify the stochasticity 
in the observation and state equations. 

In the MARSS model, x and y equations describe two stochastic processes. By tradition, one conditions 
on observations of y, and x is treated as completely hidden, hence the name 'hidden Markov process' of which 
a MARSS model is a special type. However, you could condition on (partial) observations of x and treat y as 
a (partially) hidden process — with as usual proper constraints to ensure identifiability. Nonetheless in this 
report, I follow tradition and treat x as hidden and y as (partially) observed. If x is partially observed then 
the update equations stay the same but the expectations shown in section [6] would be computed conditioned 
on the partially observed x. 

The first part of this report will review the derivation of an EM algorithm for the time-constant MARSS 
model (equation [TJ. However the main objective of this report is to show the derivation of an EM algorithm 
to solve a much more general MARSS model (section 2]) , which is a MARSS model with linear constraints 
on time- varying parameters: 

x t = B t xt i +u t + G t w t , where W t - MVN(0, Q t ) 
y t = Z t x t + a t + H t v t , where V t ~ MVN(0, R t ) (2) 
x ta = £ + Fl, where 1 - MVN(0, A) 

The linear constraints appear as the vectorization of each parameter (B, u, Q, Z, a, R, £, A) is described 
by the relation f t + D t m. This relation specifies linear constraints of the form ft + P a ,i a + ft,i& + • ■ • on 
the elements in each MARSS parameter matrix. Equation @ is a much broader class of MARSS models 
that includes MARSS models with exogenous variable (covariates), AR-p models, moving average models, 
constrained MARSS models and models that are combinations of these. The derivation also includes partially 
deterministic systems where G t , H t and F may have all zero rows. 



1.2 The joint log-likelihood function 

Denote the set of all y's and x's from t = 1 to T by y and x. The joint log-likelihoocjf] of y and x can then 
be written then as followtQ, where X t denotes the random variable and Xt is a realization from that random 
variable (and similarly for Kt)H 

f(y,x) = f(y\X=x)f(x), (3) 

where 

T 

f(x)^f(x )l[f(x t \X t 1 - 1 =x[- 1 ) 

T ^ (4) 
f(y\X = x) = f[f(y t \X = x) 



3 This is not the log likelihood output by the Kalman filter. The log likelihood output by the Kalman filter is the log L(i/; 0) 
(notice x does not appear), which is known as the marginal log likelihood. 

4 The log-likelihood function is shown here for the MARSS with non-time varying parameters (equation [TJ . 

5 To alleviate clutter, I have left off subscripts on the f's. To emphasize that the /'s represent different density functions, one 
would often use a subscript showing what parameters are in the functions, i.e. f(xt\Xt-\ = Xt— l) becomes fs,u,Q {xt \Xt — i = 
Xt-i). 
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Thus, 

T T 

f(y,x) = J] f(Vt\X = *) x /(*o) II A**!**" 1 = ^i" 1 ) 

t= \ (5) 
= fj f(y t \X t = xt) x /(*„) f[ /(x t |X t _! = 



Here x^f denotes the set of Xt from i = il to t = t2 (and thus x is shorthand for xj). The third line follows 
because conditioned on x, the t/ t 's are independent of each other (because the v t are independent of each 
other). In the last line, x' _1 becomes x t -i from the Markov property of the equation for x t (equation [Ta|) . 
and x becomes x t because y t depends only on x t (equation [Tbf . 

Since (X t \X t -i — JCt-i) is multivariate normal and (Y t \X t = Xt) is multivariate normal (equation [1} , 
we can write down the joint log- likelihoo d function using the likelihood function for a multivariate normal 



distribution (jJohnson and Wichernl . 120071 sec. 4.3) 



T T 

logL(y,x; 9) = \{y t - Zx t - a) T R" 1 (y t - Zx t - a) \ log |R| 



T 



-{x t - Bx t _! - u) T Q- 1 (x t - Bx t ! - u) - \ log |Q| (6) 



^ 2 

l 

i(«o - l) T A _1 (»o - - ^ log |A| - | log 2^ 



n is the number of data points. This is the same as equation 6.64 in lShumwav and Stoffer (120061) . The above 
equation is for the case where Xo is stochastic (has a known distribution). However, if we instead treat Xo as 
fixed but unknown (section 3.4.4 in Harvey, 1989), it is then a parameter and there is no A. The likelihood 
then is slightly different. Xq is defined as a parameter £ and 

T T 

logL(y,x; 9) = l(y t - Zx t - a) T R- 1 (y t - Zx t - a) i log |R| 

- ^ -(x 4 - Bx t _i - u) T Q _1 (x t - Bx 4 i - u) - - log |Q| 
i 2 i 2 

Note that in this case, Xo is no longer a realization of a random variable Xq; it is a fixed (but unknown) 
parameter. Equation [7] is written as if all the Xo are fixed, however when the general derivation is presented, 
it allowed that some Xo are fixed (A=0) and others are stochastic. 

If R is constant through time, then J £2 1 ^ log |R| in the likelihood equation reduces to ? log |R|, however 
sometimes one needs to includes time-dependent weighting on R@. The same applies to J2i \ 1°S IQI- 



All bolded elements are column vectors (lower case) and matrices (upper case). A T is the transpose 
of matrix A, A -1 is the inverse of A, and |A| is the determinant of A. Parameters are non-italic while 
elements that are slanted are realizations of a random variable (x and y are slatedfl 



1.3 Missing values 



In Shumway and Stoffer and other presentations of the EM algorithm for MARSS models (jShumwav and Stoffer 



2006t IZuur et al.l . 120031 ). the missing values case is treated separately from the non-missing values case. In 



6 If for example, one wanted to include a temporally dependent weighting on R replace |R| with |cttR| = where at 

is the weighting at time t and is fixed not estimated. 

7 In matrix algebra, a capitol bolded letter indicates a matrix. Unfortunately in statistics, the capitol letter convention is 
used for random variables. Fortunately, this derivation does not need to reference random variables except indirectly when 
using expectations. Thus, I use capitols to refer to matrices not random variables. The one exception is the reference to X and 
Y. In this case a bolded slanted capitol is used. 
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these derivations, a series of modifications are given for the EM update equations when there are missing val- 
ues. In my derivation, I present the missing values treatment differently, and there is only one set of update 
equations and these equations apply in both the missing values and non-missing values cases. My derivation 
does this by keeping E[K t |data] and Ep^X^jdata] in the update equations (much like E[X t |data] is kept 
in the equations) while Shumway and Stoffer replace these expectations involving Y t by their values, which 
depend on whether or not the data are a complete observation of Y t with no missing values. Section [5] shows 
how to compute the expectations involving Y t when the data are an incomplete observation of Y t ■ 

2 The EM algorithm 

The EM algorithm cycles iteratively between an expectation step (the integration in the equation) followed 
by a maximization step (the arg max in the equation): 

Qj+i = argmax / / log L(aj,y; Q)f(x, y\Y(l) = y(l), Qj)dxdy (8) 
e Jx Jy 

Y(l) indicates those Y that have an observation and y(l) are the actual observations. Note that G and 
Qj are different. If 6 consists of multiple parameters, we can also break this down into smaller steps. Let 
= {a, /3}, then 

a i+ i = argmax / / logL(a;,y, fij; a)f(x,y\Y(l) = y(l), a v /3j)dxdy (9) 
a JxJy 

Now the maximization is only over a, the part that appears after the ";" in the log-likelihood. 

Expectation step The integral that appears in equation © is an expectation. The first step in the EM 
algorithm is to compute this expectation. This will involve computing expectations like E[X t Xj\Y t (l) = 
y t (l),Qj] and E\YtX t |Y"t(l) = y t {l),@j]. The j subscript on 6 denotes that these are the parameters at 
iteration j of the algorithm. 

Maximization step: A new parameter set Oj+i is computed by finding the parameters that maximize 
the expected log-likelihood function (the part in the integral) with respect to 8. The equations that give the 
parameters for the next iteration (J + 1) are called the update equations and this report is devoted to the 
derivation of these update equations. 

After one iteration of the expectation and maximization steps, the cycle is then repeated. New expecta- 
tions are computed using 0j+i, and then a new set of parameters Qj+2 is generated. This cycle is continued 
until the likelihood no longer increases more than a specified tolerance level. This algorithm is guaranteed to 
increase in likelihood at each iteration (if it does not, it means there is an error in one's update equations). 
The algorithm must be started from an initial set of parameter values Qj.. The algorithm is not particularly 
sensitive to the initial conditions but the surface could definitely be multi-modal and have local maxima. 
See section [IT] on using Monte Carlo initialization to ensure that the global maximum is found. 

2.1 The expected log-likelihood function 

The function that is maximized in the "M" step is the expected value of the log-likelihood function. This 
expectation is conditioned on two things: 1) the observed Y's which are denoted Y(l) ana - which are equal to 
the fixed values y(l) and 2) the parameter set Qj. Note that since there may be missing values in the data, 
Y(l) can be a subset of Y, that is, only some Y have a corresponding y value at time t. Mathematically what 
we are doing is E-xY[g(X,Y)\Y(l) = y(l),0j]. This is a multivariate conditional expectation because X,Y is 
multivariate (amxiixT vector). The function <?(0) that we are taking the expectation of is log~L(Y , X : 0). 
Note that <?(0) is a random variable involving the random variables, X and Y, while logL(y,x; 8) is not a 
random variable but rather a specific value since y and x are a set of specific values. 

We denote this expected log- likelihood by The goal is to find the that maximize 9 and this becomes 
the new for the j + 1 iteration of the EM algorithm. The equations to compute the new are termed the 
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update equations. Using the log likelihood equation and expanding out all the terms, we can write out 
^ in verbose form as: 



E X ypogL(r,jr ; e);y(i)=w(i),e J ] = * = 

+ E[(ZX i ) T R" 1 ZX t ] + E[a T R -1 Zar t ] + E[(ZX t ) T R _1 a] + E[a T R _1 a]^ - ^ log |R| 



(10) 



- E[u T Q ^ - E[X7q- x u] + E[(BX*_ 1 ) T Q- 1 BX t _ 1 ] 

+ E^Q-^t-i] + E[(BX t _ 1 ) T Q- 1 u] + u T Q- 1 u^) - | log |Q 




olog|A| 




All the E[ ] appearing here denote Exy[.9()I^(1) = y(l)>©j]- I n the rest of the derivation, I drop the 
conditional and the XY subscript on E to remove clutter, but it is important to remember that whenever 
E appears, it refers to a specific conditional multivariate expectation. If Xo is treated as fixed, then Xq = £ 
and the last two lines involving A are dropped. 

Keep in mind that 8 and Qj are different. 9 is a parameter appearing in function g(X,Y, Q) (i.e. the 
parameters in equation^. X and Y are random variables which means that g(X, Y , O) is a random variable. 
We take the expectation of g(X, Y, 0), meaning we take integral over the joint distribution of X and Y. We 
need to specify what that distribution is and the conditioning on Qj (meaning the Qj appearing to the right 
of the | in ~E(g()\Qj)) is specifying this distribution. This conditioning affects the value of the expectation 
of g(X,Y, Q), but it does not affect the value of Q, which are the R, Q, u, etc. values on the right side of 
equation (|10p. We will first take the expectation of g(X,Y, 9) conditioned on Qj (using integration) and 
then take the differential of that expectation with respect to Q. 

2.2 The expectations used in the derivation 

The following expectations appear frequently in the update equations and are given special namet^): 



8 This notation is different than what you see in Shumway and Stoffer (2006), section 6.2. What I call Vt, they refer to as 
P™, and my Pt would be P™ + XjxJ in their notation. 



x t = E XY [jr t |y(i)=y(i),e J ] 

y t = E^\Y t \Y(l)=y(l),Q j ) 
P t =E XY [X t Xj\Y(l)=y(l),Q J } 
Pt,t-i = E X v[X t Xj_ 1 \Y(l)=y(l),Q j } 
V t = var XY {X t \Y(l) = y(l),Q 3 ] = P t -x t x7 
dt=E X v\VtYj\Y(l)=y(l),Q j ] 
W, = v a r XY [Y t \Y(l) = y(l),Qj] = Ot-y^ 
yx*= Ex Y \Y t Xj\Y(l)=y(l),Qj] 
yttt-x = E^[Y t Xj_ 1 \Y(l)=y(l),Q 3 ] 



T 



(11a) 
(lib) 
(11c) 
(lid) 
(lie) 
(llf) 

(Hg) 
(llh) 
(Hi) 



G 



Table 1: Notes on multivariate expectations. For the following examples, let X be a vector of length three, Xi, X2, X3. 
/() is the probability distribution function (pdf). C is a constant (not a random variable). 



E x [g(X)} = J J J g(x)f(xi,x 2 ,x 3 )dx 1 dx 2 dx 3 

E x [Xi] = J J J x 1 f(xi,x 2 ,x 3 )dxidx 2 dx 3 = J xif(xi)dxi = E[Xi] 

Ex[Xi + X 2 ] = Ex[Xi] + Ex[X 2 ] 

EjrjXi + C] = E X [X{\ +C 

Ex[CXi] = CEx[Xi] 

i: v XX x x 



The subscript on the expectation, E, denotes that this is a multivariate expectation taken over X and Y. The 
right sides of equations (|lle|) and ( |llg[ ) arise from the computational formula for variance and covariance: 

var[X] = E[XX T ] - E[X] E[X} T (12) 
cov[X,Y] = E[XY T ] - E[X]E[Y] T . (13) 

Section [6] shows how to compute the expectations in equation 1111 



3 The unconstrained update equations 

In this section, I show the derivation of the update equations when all elements of a parameter matrix are 
estimated and are all al lowed to be different, i .e. th e unconstrained case. These are similar to the update 
equations one will see in Shumwav and Stoffer ( 20061 ). Section [5] shows the update equations when there are 



unestimated (fixed) or estimated but shared values in the parameter matrices, i.e. the constrained update 
equations. 

To derive the update equations, one must find the 6, where is comprised of the MARSS parameters 
B, u, Q, Z, a, R, £, and A, that maximizes VP (equation [TO]) by partial differentiation of "J with respect to 
0. However, I will be using the EM equation where one maximizes each parameter matrix in one-by-one 
(equation [9]) . In this case, the parameters that are not being maximized are set at their iteration j values, 
and then one takes the derivative of ^ with respect to the parameter of interest. Then solve for the parameter 
value that sets the partial derivative to zero. The partial differentiation is with respect to each individual 
parameter element, for example each u^j in matrix u. The idea is to single out those terms in equation (fT0|) 
that involve Uij (say), differentiate by Uij, set this to zero and solve for Ui.j. This gives the new Uij that 
maximizes the partial derivative with respect to mj of the expected log-likelihood. Matrix calculus gives us 
a way to jointly maximize \& with respect to all elements (not just element in a parameter matrix. 



3.1 Matrix calculus need for the derivation 

Before commencing, some definitions from matrix calculus will be needed. The partial derivative of a scalar 
(vp is a scalar) with respect to some column vector b (which has elements b\, b 2 . . .) is 



'a* 

dbi 



9* 
db 2 



db n 
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Note that the derivative of a column vector b is a row vector. The partial derivatives of a scalar with respect 
to some n x n matrix B is 



dB 



r a* 


a* 






db 2 ,i 


db nA 


a* 




9* 


dbi i2 


db 2 ,2 


db n , 2 








-dbi, n 


db 2 , n 





Note that the indexing is interchanged; d^/dbi j — yd^/do^ . „ For Q and R, this is unimportant because 
they are variance- covariance matrices and are symmetric. For B and Z, one must be careful because these 
may not be symmetric. 

A number of derivatives of a scalar with respect to vectors and matrices will be needed in the derivation 
and are shown in table [5J In the table, both the vectorized and non-vectorized versions are shown. The 
vectorized version of a matrix D with dimension nxmis 



vec(D Ui . 



d n ,i 
di, 2 

d n .i 

dl, m 

dn. _ m. 



3.2 The update equation for u (unconstrained) 

Take the partial derivative of ^ with respect to u, which is a to x 1 matrix. All parameters other than u are 
fixed to constant values (because partial derivation is being done). Since the derivative of a constant is 0, 
terms not involving u will equal and drop out. Taking the derivative to equation (|10l) with respect to u: 

1 T / 

f=i V (20) 
+ d( E[(BX t _i) T Q- 1 u])/0u + d( E[u T Q _1 BJf t _i])/»u + 0(u T Q -1 u)/0uJ 

The parameters can be moved out of the expectations and then the matrix derivative relations (table [5]) are 
used to take the derivative. 

df/du = -I W - ElXtfQ- 1 - ElXtfQ- 1 + (B E[X t _ 1 ]) T Q- 1 + (B E[X t _ 1 ]) T Q- 1 + 2u T Q-^ 

(21) 
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Table 2: Derivatives of a scalar with respect to vectors and matrices. In the following a and c are n x 1 column 
vectors, b and d are m x 1 column vectors, D is a n x m matrix, Cisanxn matrix, and A is a diagonal n x n 
matrix (Os on the off-diagonals). C _1 is the inverse of C, C T is the transpose of C, CT T = (CT 1 ) = (C T ) , 
and |C| is the determinant of C. Note, all the numerators in the differentials reduce to scalars. Although the matrix 
names may be the same as in the text, these matrices are dummy matrices to show the matrix derivative relations. 

<9(a T c) /da = <9(c T a) /da = c T (14) 
<9(a T Db)/<9D = <9(b T D T a)/<9D = ba T 



d(a T Db)/<9vcc(D) = <9(b T D T a)/dvec(D) = (vec(ba T )) 



T 



<9(log |C|)/0C = -d(log IC" 1 |)/6»C = (C T )- 1 = C~ 
5(log|C|)/avcc(C) = (vcc(C- T )) T 

a(b T D T CDd)/o>D = db T D T C + bd T D T C T 

a(b T D T CDd)/9vcc(D) = ( vcc(db T D T C + bd T D T C T )) T (17) 
If b = d and C is symmetric then the sum reduces to 2bb T D T C 

<9(a T Ca)/da = d(aC T a T )/da= 2a T C (18) 

9(a T C" 1 c)/9C = C^a^C 1 
9(a T C- 1 c)/9vcc(C) = -( vcc(C- 1 ac T C- 1 )) T 



(15) 
(16) 



9 



This also uses Q 1 = (Q 1 ) T . This can then be reduced to 

T 

dy/du = J2(V[Xt} T Q~ 1 - E[X t _i] T B T Q _1 - u T Q *) (22) 
t=\ 

Set the left side to zero (a p x m matrix of zeros) and transpose the whole equation. Q _1 cancels ou10 by 
multiplying on the left by Q (left since the whole equation was just transposed), giving 

T T 

= ( E[Xt} - B E[Jr t _!] - u) = ( E[X t ] - B E[X t -i}) - u (23) 
t=i t=i 

Solving for u and replacing the expectations with their names from equation II 11 gives us the new u that 
maximizes 

1 T 

"3+1 = J? E ( X * - BX '-!) ( 24 ) 
i=l 

3.3 The update equation for B (unconstrained) 

Take the derivative of $ with respect to B. Terms not involving B, equal and drop out. I have put the 
E outside the partials by noting that d( E[h(X t , B)])/<9B = E[d(h(X t , B))/c?B] since the expectation is 



conditioned on Bj not B. 



1 T ( 



(25) 



T 

2 

- E[d((BX t _ 1 ) T Q- 1 X t )/dB}+ E[a((BX t _ 1 ) T Q- 1 (BX 4 _ 1 ))/9B] 
+ E[a((BX t _ 1 ) T Q- 1 u)/<9B] + E[a(u T Q- 1 BX t _i)/aB] > 

1 T / 

= "2 E ( - E^^Q-^t-xD/aB] 

- E[9(X7_ 1 B T Q- 1 Xt)/9B]+ E[9(X 4 T _ 1 B T Q- 1 (BA: f _ 1 ))/5B] 
+ E[9(X t T _ 1 B T Q- 1 u)/9B] + E[0(u T Q -1 R2r t _i)/flBj] 

After pulling the constants out of the expectations, we use relations (|15|) and p7|) to take the derivative and 
note that Q 1 = (Q _1 ) T : 

T 

d^/dB = -±J2 ( - E[X t ^ 1 Xj]Q- 1 - E[X t ^ 1 Xj]Q- 1 

t=i V (26) 

2E[X t _ 1 x7_ 1 ]B T Q- 1 + E[JT t _i]u T Q- 1 + E^-il^Q- 1 ' 



This can be reduced to 

/■ 

2 



9*/9B = -i £ C - 2E[X t _ 1 X t T ]Q- 1 + 2E[X t _ 1 X t T _ 1 ]B T Q- 1 + 2E[X t _ 1 ]u T Q- 1 ") (27) 



3 Q is a variance-covariance matrix and is invertible. Q X Q = I, the identity matrix. 
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Set the left side to zero (an m x m matrix of zeros), cancel out Q 
rid of the -1/2, and transpose the whole equation to give 



by multiplying by Q on the right, get 







T 

E 

T 

= E 

t=i 



(E^l-BE^^-uE^]) 



BPt i - iTx^ x ) 



(28) 



The last line replaced the expectations with their names shown in equation (fTTj) . Solving for B and noting 
that Pt_i is like a variance-covariance matrix and is invertible, gives us the new B that maximizes "J, 



s j+i = 



u T x 



t-1, 



(29) 



Because all the equations above also apply to block-diagonal matrices, the derivation immediately gen- 
eralizes to the case where B is an unconstrained block diagonal matrix: 



B 



1,1 


h,2 


&1,3 

















2,1 




&2,3 

















3,1 


&3,2 


h.3 


























hA 


h,5 




















h,4 


h,5 

















Q 








&6,6 


be,7 


^6,8 

















b7.6 


b7,7 


^7,8 

















h,e 


bs,7 


^8,8 



Bi 








B 2 







B 3 



For the block diagonal B, 



Bi> , +1 = ^(P^-u^-O E ? 



(30) 



where the subscript i means to take the parts of the matrices that are analogous to B^; take the whole part 
within the parentheses not the individual matrices inside the parentheses. If B^ is comprised of rows a to b 
and columns c to d of matrix B, then take rows a to b and columns c to d of the matrices subscripted by i 
in equation (l30l) . 



3.4 The update equation for Q (unconstrained) 

The usual way to do this derivation is to use what is known as the "trace trick" which will pull the Q _1 
out to the left of the c T Q _1 b terms which appear in the likelihood (fTU|) . Here I'm showing a less elegant 
derivation that plods step by step through each of the likelihood terms. Take the derivative of VP with respect 
to Q. Terms not involving Q equal and drop out. Again the expectations are placed outside the partials 
by noting that 8(E[h{X u Q)])/8Q= E[0(fc(JfT t ,Q))/0Q]. 

d*/dQ = (EldiXjQ^X^/dQ] - E[5(X t T Q- 1 BX t _ 1 )/9Q] 

1 t=i v 

- E[a((BX t _ 1 ) T Q- 1 X 4 )/3Q] - E[9(x7q- x u)/9Q] 

- E[5(u T Q- 1 X 4 )/9Q] + E[«9((BX t _ 1 ) T Q- 1 BX 4 _ 1 )/9Q] ( 31 ) 
+ E[a((BX t _ 1 ) T Q- 1 u)/9Q] + E[9(u T Q- 1 BX i _ 1 )/9Q] 

+ d(u T q- 1 u)/dQ\ -d(^]og\Q\\/dQ 
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The relations (|19p and (|16l) are used to do the differentiation. Notice that all the terms in the summation 
are of the form c T Q *b, and thus after differentiation, all the c T b terms can be grouped inside one set of 
parentheses. Also there is a minus that comes from equation (|19|) and it cancels out the minus in front of 
the initial —1/2. 



d*/dQ - Ij^Q-'t E[X t Xj] - EX,[BX t ,)'] - E[BX t ^Xj] - E[X t u T ] - E[uXj] 

z t=i \ 

+ E[BX t _ 1 (BX t _ 1 ) T ] + E[BX t _ lU T ] + E[u(BI M ) T ] + uu^CT 1 - ^CT 1 

Pulling the parameters out of the expectations and using (BXt) T = XjB T , we have 
1 T / 

0tf/0Q = x^TCT 1 E[X t Xj]- E[X t x7_ 1 ]B T -BE[X t _ 1 X t T ]- E[X t ]u T - uE[Xj 



(32) 



+ B E[X t - 1 Xj_ 1 ]B T + B E[X t ^}u T + uE[x7_ 1 ]B T + uu T J Q 1 - ^CT 1 
The partial derivative is then rewritten in terms of the Kalman smoother output: 

T 



(33) 



d^/dQ =^J2Q~ 1 ( Pt ~ P t ,t-iB T - BP t _! 



t=i 



t - x t u T - ux 



+ BP t : B +Bx 4 iu +ux t _ 1 B l +uu Q 



T 



(34) 



Q 



Setting this to zero (a m x m matrix of zeros), Q 1 is canceled out by multiplying by Q twice, once on the 
left and once on the right and the 1/2 is removed: 



TQ = ( P * - P M i bT " BP t _i >t - S f u T - ux 4 T + BP t iB T + BX f lU T + ux^ ^B 1 



(35) 



This gives us the new Q that maximizes 

T 



t B -BP t _ M -x t u' -ux t ' 



+ BP t iB T + Bx ( _iu T + ux t r _ 1 B T + uu 
This derivation immediately generalizes to the case where Q is a block diagonal matrix: 



(36) 



Q 



In this case, 





91,2 


91,3 














" 


91,2 


92,2 


92,3 

















91,3 


92,3 


93,3 


























94,4 


94,5 




















94.5 


95,5 














Q 











96,6 


96,7 


96,8 

















96,7 


97,7 


97,8 

















96,8 


97,8 


98,8_ 




















fl = 


1 T 


(*■- 


P*,t- 


iB T 


BP t 








t=l 












+ 


BP 4 


-iB T 


f BX t 


1U T 


+ ux t 


-i pT 


+ UU 



Qi 
Q 2 







Q 3 



-T 



(37) 
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where the subscript i means take the elements of the matrix (in the big parentheses) that are analogous to 
take the whole part within the parentheses not the individual matrices inside the parentheses). If Q { 
is comprised of rows a to 6 and columns c to d of matrix Q, then take rows a to 6 and columns c to d of 
matrices subscripted by i in equation (|57| . 

By the way, Q is never really unconstrained since it is a variance-covariance matrix and the upper 
and lower triangles are shared. However, because the shared values ar e only the symmetric values in the 



matrix, the derivation still works even though it's technically incorrect (jHenderson and Searle . 1979). The 



constrained update equation for Q shown in section 15.81 explicitly deals with the shared lower and upper 
triangles. 

3.5 Update equation for a (unconstrained) 

Take the derivative of ^ with respect to a, where a is a n x 1 matrix. Terms not involving a, equal and 
drop out. 



<9*/<9a= -\Y1 ( -3(E[y t T R" 1 a])/9a-a(E[a T R- 1 F t ])/5a 

' ^ ' (38) 



1=1 



+ d{ E[(ZX t ) T R _1 a])/aa + d( E[a T R _1 ZX t ])/0a + d( E[a T R _1 a])/0a 

The expectations around constants can be dropped^. Using relations (fl"4"|) and (fT8"|) and using R 1 = 
(R _1 ) T , we have then 



d^/da = -i ( ~ Ep^R -1 ] - EfF^R -1 ] + E[(ZX t ) T R _1 ] + E[(ZA" t ) T R _1 ] + 2a T R~ lN ) (39) 



t=i 



Pull the parameters out of the expectations, use (ab) T = b T a T and R 1 = (R 1 ) T where needed, and 
remove the —1/2 to get 



8^ /da = I E[F t ] T R _1 - E[Xj] T Z T R- 1 - a T R 1 J (40) 
t=i ^ ' 

Set the left side to zero (a 1 X n matrix of zeros), take the transpose, and cancel out R _1 by multiplying by 
R, giving 

T T 

= J2( HYt] - Z E[X t ] - a) = Y, (Yt ~ 2% - a) (41) 
t=i t=i 

Solving for a gives us the update equation for a: 



1 T 



T 

t=i 



because Exy(C) = O, where C is a constant. 
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3.6 The update equation for Z (unconstrained) 

Take the derivative of \& with respect to Z. Terms not involving Z, equal and drop out. The expectations 
around terms involving only constants have been dropped. 

d^/dZ = (note dZ ismxn while Z is n x m) 
i T / 

-J2 ( - e[9(f7r- 1 zxo/9z] - E[a((zx t ) T R^ 1 y t )/9z] + E[0((zx t ) T R _1 z*t)/0z] 

(43) 



2 

t=i x 

E[0((ZX t ) T R _1 a)/0Z] + E[9(a T R~ 1 ZX t )/5Z 

T 

2 



J V ( - Epor^R^zXtj/sz] - E[d(xJz T K~ 1 Y t )/dz} + E[d(xJz r R- 1 zx t )/dz] 

2 4=1 V 

E[a(x7z T R- 1 a)/az] + E[0(a T R -1 ZJT t )/0Z] 



Using the matrix derivative relations (tabled]) and using R 1 = (R 1 ) T , we get 

T 

2 



(44) 



d^/dz = -^ J2 (- E^y^rVE^t^r 1 ] 

+ 2E[X t X 4 T Z T R~ 1 ] + E[Jr t -ia T R _1 ] + E[Jr t a T R _1 ] 
Pulling the parameters out of the expectations and getting rid of the —1/2, we have 

<9*/<9Z = ( E[A" t y7]R _1 - E[X f X f T ]Z T R" 1 - E[X 4 ]a T R~ lN ) (45) 



Set the left side to zero (a m x n matrix of zeros) , transpose it all, and cancel out R by multiplying by R 
on the left, to give 

T T 

= ]T (E[Y t Xj] - ZE[M t T ] - aE[X t T ]) = £ (y* t - ZP t - e&J) (46) 
t=i 4=1 

Solving for Z and noting that Pt is invertible, gives us the new Z: 

Z J+1 = (^(ylc t -ax t T )V]rp t ) (47) 

3.7 The update equation for R (unconstrained) 

Take the derivative of "J with respect to R. Terms not involving R, equal and drop out. The expectations 
around terms involving constants have been removed. 



T 

2 ■ 

4=1 

E[d(YjR- 1 a)/dR] - E[0(a T R -1 Y t )/aR] + E[d((ZXt) T Br l ZX t )/8R] ( 48 ) 



1 T ( 

d^/dK=--J2 EidiYjR^Y^/dR] - E[d{Y] R^ZX^/dR] - E[d({ZX t ) T B.- 1 Y t )/dR] 



E[0((ZJT t ) T R -1 a)/aR] + E[0(a T R _1 ZX t )/0R] + 5(a T R _1 a)/5R ) - log |R|)/SR 
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We use relations ([TO)) and (TTB1) to do the differentiation. Notice that all the terms in the summation are of 
the form c T R _1 b, and thus after differentiation, we group all the c T b inside one set of parentheses. Also 
there is a minus that comes from equation (|19p and cancels out the minus in front of —1/2. 



1 T ( 

d*/dR= -^Rr 1 E[y t F7] - V[Y t (ZX t ) T ] - E[ZX t Yj] - E[F f a T ] - E[aYj] 



T 

2 

t=i 

+ E[ZX t (ZX t ) T ] + E[ZJ t a T ] + E[a(ZX t ) T ] +aa T ^R _1 - -^R" 1 
Pulling the parameters out of the expectations and using (ZY t ) T — Y t Z T , we have 

9*/9R= - ^R-^E[r t F t T ] - E[Y t Xj]Z T - ZE[X t Yj] - E[y 4 ]a T — aE[Y"J] 

+ Z E[X t Xj]Z T + ZE[I,]a T + aE[Xj]Z T + aa^R" 1 - ^R" 1 

We rewrite the partial derivative in terms of expectations: 

1 T f 

OV/dR = - R 1 [Ot ~ 3^ t Z T ~ ZJ^7 - y t a T - ayj 
+ ZP t Z T + ZX t a T + aX f r Z T + aa T J R 1 - -^R" 1 



(49) 



(50) 



(51) 



Setting this to zero (a n x n matrix of zeros), we cancel out R 1 by multiplying by R twice, once on the left 
and once on the right, and get rid of the 1/2. 

TR = ^ fo t - yx t Z T - Z$Zj - y f a T - ay t T + ZP f Z T + ZX t a T + aiJZ T + aa T ') (52) 

We can then solve for R, giving us the new R that maximizes 

r j+i = 4 (°* - ^ zT - Z ^* T - ^ aT - a ^ + z p* zT + z ** aT + a ^ zT + aaT ) ( 53 ) 

As with Q, this derivation immediately generalizes to a block diagonal matrix: 

R 



Ri 
R 2 
R 3 



In this case, 

/ 

T 



R^j+i = ^ ( °* - y x * zT - z y^ - y* aT - a ^ + ZP * zT + z * taT + a ^ zT + aaT ) ( 54 ) 



where the subscript i means we take the elements in the matrix in the big parentheses that are analogous to 
Ri. If Ri is comprised of rows a to b and columns c to d of matrix R, then we take rows a to 6 and columns 
c to d of matrix subscripted by i in equation (|54l) . 
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3.8 Update equation for £ and A (unconstrained), stochastic initial state 

Shumwav and Stofferl ( 20061 ) and iGhahramani and Hintonl (|l996f ) imply in their discussion of the EM algo- 



rithm that both £ and A can be estimated (though not simultaneously). Harvey (1989), however, discusses 
that there are only two allowable cases: x$ is treated as fixed (A = 0) and equal to the unknown parameter 
£ or Xq is treated as stochastic with a known mean £ and variance A. For completeness, we show here the 
update equation in the case of Xq stochastic with unknown mean £ and variance A (a case that Harvey 
(1989) says is not consistent). 

We proceed as before and solve for the new £ by minimizing Take the derivative of "J with respect to 
£ . Terms not involving £, equal and drop out. 

^/^ = -i(-a(EK T A- 1 Xo])/^-a(E[X T A- 1 ^])/^ + ^ T A- 1 0/5|) (55) 

Using relations (fl4|) and (|18|) and using A -1 = (A _1 ) T , we have 

3®ldt = -\{- E^A- 1 ]- E[X ( [A- 1 ]+2C T A- 1 ) (56) 
Pulling the parameters out of the expectations, we get 

dV/dt =-\{- 2 E t*o ]A" X + 2£ T A- X ) (57) 

We then set the left side to zero, take the transpose, and cancel out —1/2 and A^ 1 (by noting that it is a 
variance-covariance matrix and is invertible). 

= (A" 1 E[X ] + A" 1 *) = (x - €) (58) 

Thus, 

ij+i = x o (59) 

xo is the expected value of Xq conditioned on the data from t = 1 to T, which comes from the Kalman 
smoother recursions with initial conditions defined as E[-X"o|Y"o = j/ ] = £ and var(-X"o-Xo |Ko = y ) = A. A 
similar set of steps gets us to the update equation for A, 

A J+ i = V (60) 

Vo is the variance of Xo conditioned on the data from t = 1 to T and is an output from the Kalman smoother 
recursions. 

If the initial state is defined as at t = 1 instead of t = 0, the update equation is derived in an identical 
fashion and the update equation is similar: 

€ i+ i=xi (61) 
A i+ i - Vi (62) 

These are output from the Kalman smoother recursions with initial conditions defined as E[Xi|Y"o = y ] = £ 
and vai(XiX 1 \Yq = y ) = A. Notice that the recursions are initialized slightly differently; you will see the 
Kalman filter and smoother equations presented with both types of initializations depending on whether the 
author defines the initial state at t = or t = 1. 

3.9 Update equation for £ (unconstrained), fixed x 

For the case where Xq is treated as fixed, i.e. as another parameter, then there is no A, and we need to 
maximize d^/d^ using the slightly different ^ shown in equation ([?])• Now £ appears in the state equation 
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part of the likelihood. 

d9/de=-±(-E[d(XjQ- 1 BQ/dt] - E[3((B^) T Q- 1 X 1 )/^] + E[5((B^) T Q- 1 (B^))/^] 

+ E[9((B|) T Q- 1 u)/^] + Eia^Q-^O/^]) 

1 / (63) 
= --[- E[9(X7Q- x BO/^] - E[3(| T B T Q- 1 X 1 )/^] + E[a(| T B T Q- 1 (BO)/5^] 



2 

+ E[a(| T B T Q- 1 u)/a^] + E[9(u T Q- 1 B^)/^; 
After pulling the constants out of the expectations, we use relations (fT5|) and (|T7|) to take the derivative: 

dV/d£ = -\{^~ E[X 1 ] T Q- 1 B - E[X 1 ] T Q- 1 B + 2£ T B T Q _1 B + u T Q X B + u T Q- 1 B^ (64) 
This can be reduced to 

d^>/d£ = E[Xi] T Q _1 B - £ T B T Q -1 B - u T Q *B (65) 

To solve for £, set the left side to zero (an m x 1 matrix of zeros), transpose the whole equation, and then 
cancel out B T Q X B by multiplying by its inverse on the left, and solve for £. This step requires that this 
inverse exists. 

£ = (B T Q- 1 B)- 1 B T Q- 1 (E[X 1 ] - u) (66) 

Thus, in terms of the Kalman filter/smoother output the new £ for EM iteration j + 1 is 

€ J+1 = (B T Q- 1 B)- 1 B T Q- 1 (X 1 - u) (67) 

Note that using, Xo output from the Kalman smoother would not work since A = 0. As a result, = £j 
in the EM algorithm, and it is impossible to move away from your starting condition for 

This is conceptually similar to using a generalized least squares estimate of £ to concentrate it out of the 
likelihood as discussed in Harvey (1989), section 3.4.4. However, in the context of the EM algorithm, dealing 
with the fixed Xo case requires nothing special; one simply takes care to use the likelihood for the case where 
Xo is treated as an unknown parameter (equation [7]) . For the other parameters, the update equations are 
the same whether one uses the log-likelihood equation with Xo treated as stochastic (equation [6]) or fixed 
(equation [7]) . 

If your MARSS model is stationarj{3 and your data appear stationary, however, equation (|6"6")l probably 
is not what you want to use. The estimate of £ will be the maximum-likelihood value, but it will not be 
drawn from the stationary distribution; instead it could be some wildly different value that happens to give 
the maximum-likelihood. If you are modeling the data as stationary, then you should probably assume that 
£ is drawn from the stationary distribution of the X's, which is some function of your model parameters. 
This would mean that the model parameters would enter the part of the likelihood that involves £ and A. 
Since you probably don't want to do that (if might start to get circular) , you might try an iterative process 
to get decent £ and A or try fixing £ and estimating A (above). You can fix £ at, say, zero, by making sure 
the model you fit has a stationary distribution with mean zero. You might also need to demean your data 
(or estimate the a term to account for non-zero mean data). A second approach is to estimate X\ as the 
initial state instead of Xq. 



meaning the X's have a stationary distribution 
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3.10 Update equation for £ (unconstrained), fixed X\ 

In some cases, the estimate of Xq from X\ using equation [57] will be highly sensitive to small changes in the 
parameters. This is particularly the case for certain B matrices, even if they are stationary. The result is 
that your £ estimate is wildly different from the data at t = 1. The estimates are correct given how you 
defined the model, just not realistic given the data. In this case, you can specify £ as being the value of x 
at t = 1 instead of t = 0. That way, the data at t — 1 will constrain the estimated £. In this case, we treat 
Xi as fixed but unknown parameter £. The likelihood is then: 



T T 

logL(j/,x; 9) = -(y t - Zx t - a) T R- 1 ( !/t - Zx t - a) i log |R| 
i i 

T T 

- X! o( x * ~ Bx *-i ~ u) T Q _1 (x t - Bx t i - u) - 2^ - log |Q| 



wt/dt = -\(- ndiYjR-'z^/dt} - E[a((z^) T R- 1 y 1 )/^] + e[s((z^) t r- x (zo)/^] 



2 

E[a((Z^) T R- 1 a)/3|] + E[9(a T R- 1 Z|)/^] 

if- ndiXlQ-'m/m - nd((BO T Q- 1 X 2 )/d^] + E^B^O,- 1 ^))/^] 



2 



(69) 



+ E[0((B£) T Q _1 u)/0£] + E[5(u T Q- 1 B£)/ae] 

Note that the second summation starts at t = 2 and £ is X\ instead of Xq. 

After pulling the constants out of the expectations, we use relations (|15[) and (jTTJ) to take the derivative: 



d^/d£ = -~ f - E[Fi] T R" 1 Z - E[F 1 ] T R" 1 Z + 2£ T Z T R _1 Z + a T R X Z + a T R _1 Z^ 
if- E[X 2 ] T Q _1 B E[X 2 ] T Q _1 B + 2| T B T Q 1 B + u T Q X B + u T Q _1 B 



(70) 



2 

This can be reduced to 

d^/di = E[y 1 ] T R" 1 Z - £ T Z T R *Z - a T R X Z + E[X 2 ] T Q~ 1 B - £ T B T Q _1 B - u T Q X B 
= -£ T (Z T R _1 Z + B T Q~ 1 B) + E[Fi] T R" 1 Z - a T R X Z + E[X 2 ] T Q J B - u T Q _1 B 



(71) 



To solve for £, set the left side to zero (an m X 1 matrix of zeros), transpose the whole equation, and solve 
for £. 

^ = (Z T R- 1 Z + B T Q^ 1 B)^ 1 (Z T R- 1 (E[Fi] -a)+B T Q- 1 (E[X 2 ] -u)) (72) 
Thus, when £ = x\, the new £ for EM iteration j + 1 is 

= (Z T R- X Z + B T Q- 1 B)- 1 (Z T R- 1 (y 1 - a) + B T Q ^(x, - u)) (73) 

4 The time-varying MARSS model with linear constraints 

The first part of this report dealt with the case of a MARSS model (equation [1]) where the parameters are 
time-constant and where all the elements in a parameter matrix are estimated with no constraints. I will 
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now describe the derivation of an EM algorithm to solve a much more general MARSS model (equation 
[74)) . which is a time- varying MARSS model where the MARSS parameter matrices are written as a linear 
equation f + Dm. This is a very general form of a MARSS model, of which many (most) multivariate 
autoregressive Gaussian models are a special case. This general MARSS model includes as special cases, 
MARSS models with covariates (many VARSS models with exogeneous variables), multivariate AR lag-p 
models and multivariate moving average models, and MARSS models with linear constraints placed on the 
elements within the model parameters. The objective is to derive one EM algorithm for the whole class, 
thus a uniform approach to fitting these models. 
The time- varying MARSS model is written: 

x t = B t x t i + u t + H t w t , where W t ~ MVN(0, Q t ) (74a) 

y t = Z t x t + a t + G t v t , where V t ~ MVN(0, R t ) (74b) 

x to = £ + PI, where t = or t = 1 (74c) 

L - MVN(0, A) (74d) 



w, 



MVN(0,£), £ = 



Q t 
R t 



(74e) 



This looks quite similar to the previous non-time varying MARSS model, but now the model parameters, 
B, u, Q, Z, a and R, have a t subscript and we have a multiplier matrix on the error terms v t , w t , 1. The 
H t multiplier is m x s, so we now have s state errors instead of m. The G t multiplier is n x k, so we now 
have k observation errors instead of n. The F multiplier is m x j, so now we can have some initial states 
(j of them) be stochastic and others be fixed. I assume that appropriate constraints are put on G and H 
so that the resulting MARSS model is not under- or over-const r ained 12 ! . The notation/pres entation here 
was influenced by SJ Koopman's work, esp. iKoopman and Ooms ( 201 it ) and Koopman ( 19931 ). but in these 



works, Q t and Rt equal I and the variance-covariance structures are instead specified only by H t and Gt. I 
keep Q t and Rt in my formulation as it seems more intuitive (to me) in the context of the EM algorithm 
and the required joint-likelihood function. 

We can rewrite this MARSS model using vec relationships (table [3]): 

x t = {xj_ x ®I m )vec(B t ) + vec(ut) +H t w t ,W t ~ MVN(0,Q t ) 
y t = (xj <g> I n ) vec(Zt) + vec(a t ) + G t v t , V t ~ MVN(0, R t ) (75) 
x to =£ + Fl,L~ MVN(0,A) 

Each model parameter, B t , u t , Q t , Z t , a t , and R t , is written as a time-varying linear model, ft + D t m, 
where f and D are fully-known (not estimated and no missing values) and m is a column vector of the 
estimates elements of the parameter matrix: 



vec(Bt) 


= ft,b 


f Dt, b ^ 


vec(u t ) 


— ft,u 


+ D tjU t; 


vec(Q t ) 


= u. q 


+ D t „q 


vec(Zt) 


= f M 


+ Dt,.C 


vec(a t ) 


= ft, a 


+ D tia a 


vec(Rt) 


= ft,r 


+ D t , r r 


vec(A) 


= fA + 


D A A 


vec(£) 


= h + 





(76) 



The estimated parameters are now the column vectors, [3, v, q, £, a, r, p and A. The time- varying aspect 
comes from the time- varying f and D. Note that variance-covariance matrices must be positive-definite 



"For example, if both G and H are column vectors, then the system is over-constrained and has no solution. 
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and we cannot specify a form that cannot be estimated. Fixing the diagonal terms and estimating the 
off-diagonals would not be allowed. Thus the f and D terms for Q, R and A are limited. For the other 
parameters, the forms are fairly unrestricted, except that the Ds need to be full rank so that we are not 
specifying an under-constrained model. 'Full rank' will imply that we are not trying to estimate confounded 
matrix elements; for example, trying to estimate a\ and ai but only a\ + a 2 appear in the model. 

The temporally variable MARSS model, equation ([75)1 together with ([751) . l°°ks rather different than 
other temporally variable MARSS models, such as a VARSSX or MARSS with covariates model, in the 
literature. But those models are special cases of this equation. By deriving an EM algorithm for this more 
general (if unfamiliar) form, I then have an algorithm for many different types of time-varying MARSS 
models with linear constraints on the parameter elements. Below I show some examples. 



4.1 MARSS model with linear constraints 

We can use equation (|75p to put linear constraints on the elements of the parameters, B, u, Q, Z, a, R, £ 
and A. Here is an example of a simple MARSS model with linear constraints: 



a 
2a 



t-i 



W'l 

w 2 



Wi 
W 2 



MVN 



0.1 

u + 0.1 



9n 

921 



912 
922 



1)1 






c 








c 


Vs_ 


t 






-e + 2 




Vl 








Vl 




- MVN 




. V 3_ 


t 





3c 



2d 
d 

e 



r 
2r 







4r 



MVN 



1 
1 



Linear constraints mean that elements of a matrix may be fixed to a specific numerical value or specified as 
a linear combination of values (which can be shared within a matrix but not shared between matrices). 

Let's say we have some parameter matrix M (here M could be any of the parameters in the MARSS 
model) where each matrix element is written as a linear model of some potentially shared values: 



M = 



f 2c + 
-1.2 




0.9 c 
a 
3c + 1 b 



Thus each i-th element in M can be written as /3j + /3 a .;a + f3b.ib + (3 c ,iC, which is a linear combination of three 
estimated values a, b and c. The matrix M can be rewritten in terms of a Pi part and the part involving the 



M = 



2 

-1.2 




0.9 

1 



2c 





a 
3c 



= M flxod + M 



free 
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The vec function turns any matrix into a column vector by stacking the columns on top of each other. Thus, 



+ 2c + 
-1.2 


0.9 

vec(M) = a 

3c + 1 
c 

b 

We can now write vec(M) as a linear combination of f = vec(Mfj xcc i) and Dm = vec(Mf rcc ). m is a p x 1 
column vector of the p free values, in this case p = 3 and the free values are a, b, c. D is a design matrix that 
translates m into vec(Mf rcc ). For example, 



vec(M) 



'a + 2c + 2 









"1 


2 


0" 


-1.2 




-1.2 


















2 













0.9 




0.9 













a 







+ 


1 








3c + 1 




1 










3 


c 















1 





















b 












1 






f + Dm 



There are constraints on D. Your D matrix needs to describe a solvable linear set of equations. Basically 
it needs to be full rank (rank p where p is the number of columns in D or free values you are trying to 
estimate), so that you can estimate each of the p free values. For example, if a + b always appeared together, 
then a + b can be estimated but not a and b separately. Note, if M is fixed, then D is undefined but that is 
fine because in this case, there will be no update equation needed; you just use the fixed value of M in the 
algorithm. 



4.2 A MARSS model with exogenous variables 

The following is a commonly seen MARSS model with covariates g t and h t appearing as additive elements: 

x t = Bxt_i + Cg 4 + w t 
y t = Zx t + Fh t + v ( 

We would typically want to estimate C or F which are the influence of our covariates on our responses, x or 
y. Let's say there are p covariates in h t and q covariates in g 4 . Then we can write the above in vec form: 

x t = (xj_i ® I m ) vec(B) + (h t T <g) lp) vec(C) + w t 
y t = (xt <g> I n ) vec(Z) + (g t ! <g> I g ) vec(D) + v t 

Let's say we put no constraints B, Z, Q, R, £, or A. Then in the form of equation (1751) . 

x t = (xj_ ± I m ) vec(B t ) + vec(u t ) + w t 
y t = (xj ® I n ) vec(Zt) + vec(a t ) + v t , 
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Table 3: Kronecker and vec relations. Here A is n x m, B is m x p, C is p x q, and E and D are p x p. aisamxl 
column vector and b is a p x 1 column vector. The symbol <g> stands for the Kronecker product: A ® C is a np x mg 
matrix. The identity matrix, I„, is a n x n diagonal matrix with ones on the diagonal. 



vec(a) = vec(a T ) = a 

The vec of a column vector (or its transpose) is itself. (77) 
a= (a T ®Ii) 

vec(Aa) = (a T ® I„) vec(A) = Aa . . 

vec(Aa) = Aa since Aa is itself an to x 1 column vector. 

vec(AB) = (I p A) vcc(B) = (B T ® I„) vec(A) (79) 

vec(ABC) = (C T ® A) vec(B) (80) 

(A ® B)(C (gi D) = (AC ® BD) (81) 



(a®I p )C = (a® C) 

C(a T ® I,) = (a T ® C) 

E(a T (g) D) = ED(a T ® I p ) = (a T ® ED) 



(82) 



(acg)I p )C(b T = (ab T ®C) (83) 

(84) 

(A T ®B T ) = (A®B) T (85) 



(a <E> a) = vcc(aa T ) 

(a T (g) a T ) = (aig>a) T = (vcc(aa T )) T 
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with the parameters defined as follows: 



vec(B t ) 


= ft,6 + D tiW 8; f t , 6 = 


0; D t , 6 = 


= i;P 


= vec(B) 




vec(u t ) 


= ft,u + D t , u u; f t ,„ = 


= 0; D t , u 


= (h t T 


<g> I p ); v = 


vec(C) 


vec(Q 4 ) 


= f t , g + D t , g q; f t , g = 


0; D t ,, 


= D g 






vec(Zt) 


= f t , z + D M C; ft,. = 


0; D t) , 


= i;C 


= vec(Z) 




vec(at) 


= f t , Q + D. ,a: ff, a = 


= 0; D t , a 


= (g t T 




vec(F) 


vec(Rt) 


= ft,r + D t , r r; ff, r = 


0; D t , r = 


= D r 






vec(A) 


= f A + D A A; f A = 










vec(|) 


= C = f c + D ? p; f ? = 


= 0; D e = 


= 1 







Note that variance-covariance matrices are never unconstrained really so we use D g , D r and Da to specify 
the symmetry within the matrix. 

The transformation of the simple MARSS with covariates (equation into the form of equation (|75|) 
may seem a little painful, but the advantage is that a single EM algorithm can be used for a large class of 
models. Presumably, the transformation of the equation will be hidden from users by a wrapper function that 
does the reformulation before passing the model to the general EM algorithm. In the MARSS R package, 
this reformultion is done in the MARSS .marxss function. 



4.3 A general MARSS model with exogenous variables 

Let's imagine now a very general MARSS model with various 'inputs'. ' input' here just means that it is 
some fully known matrix rather than something we are estimating. It could be a sequence of 0s and Is if 
for example we were fitting a before/after sort of model. Below the letters with a t subscript are the inputs, 
except x, y, w and v. 

x t = JtBLtXt-i + CjUgj + G t w t 
y t = MtZNtXt + Ft Ah t + H f v t 



In vec form, this is: 



x t = {xj_ x (g> I m )(Lj <g> J t ) vec(B) + (gj <8> C t ) vec(U) + G t w t 

= ® I m )(L t T g J t )(f 6 + D 6 /3) + (g 4 T ® Ct)(f u + T> u v) + G t w t 
W t ~ MVN(0, G t QGj) 



y t = (xj <g> I„)(N t T ® M t ) vec(Z) + (h t T ® F t ) vec(A) + H 4 v t (88) 

= (xj ® I„)Z t (f* + D,C) + A t (f a + D a a) + H 4 v t 
V t ~ MVN(0,H t RH7) 



X to ~ MVN(f s + D^p, FAF 1 ), where vec(A) = f A + D A A 

We could write down a likelihood function for this model but written this way, the model presumes that 
HfRH^, GfQG^, and FAF T are valid variance-covariance matrices. I will actually write this model 
differently below because I don't want to make that assumption. 
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We define the f and D parameters as follows. 



vec(B t ) 


= U,b 


+ D t) ^ - 


(l7 


® Jt)f b + 






vec(u t ) 


= ft,« 


+ T>t, u v = 


(g t T 


® C t )f„ 4 


-(gt T * 


5C t )D„u 


vec(Q t ) 


= ft, q 


+ D t)? q = 


(G t 


<8>G t )f,+ 


(G t « 


G t )D 9 q 


vec(Z 4 ) 


= f M 


+ D M C = 


(N t T 


<8>M t )f,- 


+■ (N t T 


®M t )D,C 


vec(a t ) 


= ft, a 


+ D t , a a = 


(h t T 


®F t )f a + 




)F t )D a a 


vec(R t ) 


= ft,r 


+ D t , r r = 


(H t 


8>H t )f,+ 


(H t ® 


H t )D r r 


vec(A) 


= f\-\ 


D A A = 


+ D 








vec(£) 


= £ = 


f e + d?p 


= 0- 


Hp 







Here, for example fb and Dh indicate the linear constraints on B and ft.b is (L t ®Jt)f& and Dt,6 is (L t ®Jt)D(,. 
The elements of B that are being estimated are /? arranged as a column vector. 

As usual, this reformulation looks cumbersome, but would be hidden from the user presumably. 

4.4 The expected log-likelihood function 

As mentioned above, we do not necessarily want to assume that GtRtG^, HtQ t Hj , and FAF T are valid 
variance-covariance matrices. This would rule out many MARSS models that we would like to fit. For 

T 

HQH 1 would be an invalid variance-variance matrix. However, this is a 



example, if Q = a and H = 



valid MARSS model. 

Instead I will define $ t = (H t r H t ) _1 H f r , E t = (G t r G t )~ 1 G t r , and n = (F T F)~ 1 F T . I then require that 
the inverses of G t Gt, H^Ht, and F T F exist and that ft, q + Df 9 q, ff. r 4- D t r r, and f \ + T>\\ specify valid 
variance-covariance matrices. These are much less stringent restrictions. 

For the purpose of writing down the expected log-likelihood, our MARSS model is now written 

$ t x t = $ t (x7_i ® I™) vec(Bt) + $t vec(ut) + w tj where W f ~ MVN(0, Q t ) 
E t y t = E t (xJ ®I„)vec(Z t ) +S t vec(a t ) +v t , where V t ~ MVN(0, R t ) (89) 
nx fo = II£ + 1, where L - MVN(0, A) 

As mentioned before, this relies on G and H having forms that do not lead to over- or under-constrained 
linear systems. 

To derive the EM update equations, we need the expected log-likelihood function for the time-varying 
MARSS model. Using equation (|89|). we get 



E XY [logL(F,X;e)] = -iE XY ^(F t - (Xj ® I m )vec(Z t ) - vec(a t )) T s7R f - 1 S f 

T 

(Y t - (Xj ® I m ) vec(Zt) - vec(at)) + ^ log |R t | 

i 

T 

+ 53 (JT t - (X t T -i ® I™) vec(Bt) - vec(ut)) T $ t T Q( -i$ t (90) 



«o + l 



(X t - {Xj_ x ®I m )vec(B t ) - vec(ut)) + log|Q t | 

to+l 

+ {X tQ - vcc($)) T n T A- 1 n(Af io - vec(C)) + log |A| + log27r 
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If any G t , H t or F is all zero, then the line in the likelihood with R f , Q t or A, respectively, does not appear. 
If any x to are fixed, meaning all zero row in F, that X t[) = £ anywhere it appears in the likelihood. The way 
I have written the general equation, some Xt might be fixed and others stochastic. 
The vec of the model parameters are defined as follows: 



vec(rit) 


c 

— *t,b 


+ i->t,bP 


v cty lit J 


— 1 t,u 


+ n< i) 


vec(Zt) 


*-T,Z 


+ D f X 


vec(a t ) 




+ D ti(1 a 


vec(Q t ) 


= h, q 


+ D t , 9 q 


vec{RR t ) 


= f tjr 


+ D t , r r 


vec(£) 


= U + 




vec(A) 


= fx4 


D A A 


$t 


= (G/ 




— ; 


= (H/ 


H t )- x H t T 


n 


= (F T 


F)- 1 F T 



5 The constrained update equations 

The derivation proceeds by taking the partial derivative of equation [90] with respect to the estimated terms, 
the £, ex, etc, setting the derivative to zero, and solving for those estimated terms. Conceptually, the algebraic 
steps in the derivation are similar to those in the unconstrained derivation. 



5.1 The general u update equations 

We take the derivative of \P (equation 150)) with respect to v. 

d^/dv = -\Y j {~ d(E[Xj® t -D t , u v})/dv - d(E[v T Bj u Q t X t ])/dv 
t=i ^ 

+ d(E[((Xj_ 1 ®I rn )vec(B t )) T Q t V t>u v})/dv + d(V[v T T>lMXj_ 1 ® I m ) vec(B t )])/^ ( 91 ) 
+ d(v T T>Z u <Q t r> t , u v)/dv + d(E[iJ u Q t B t . u v])/dv + d(E[v T nl u Q t f t>v ])/dv 

where Q t = $ t T Qr 1$ t- 

Since v is to the far left or right in each term, the derivative is simple using the derivative terms in table 
13.11 d^/dv becomes: 



d^/dv = ~J2( - 2E[X 4 T <Q t D f ,„] + 2E[((X i T _ 1 ® I m ) vec(B t )) T Q t D 4 , u ] 

(92) 



+ 2(u T D t ^Q t D t)U ) + 2E[f^Q t D t , u ] 
Set the left side to zero and transpose the whole equation. 

T 



= E n luQt E[X t ] - D t ^Q t (E[X t _i] T ® I m ) vec(B t ) - D^Q t D t ,„« - D^Q t f t ,„ (93) 
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Thus, 

T T 

(^D^Q t D t)U )u = J2 T >J,uQt(nX t ] - (E[X t _!] T ®I m )vec(B t ) - f t ,„) (94) 
t=i t=i 

We solve for v, and the new v for the j + 1 iteration of the EM algorithm is 

v j+1 = D^QjDt,^ J] D^Q t (x t - (x7_ x ® I m ) vec(B t ) - f t , u ) (95) 
^ t=i ' t=i 

The update equation requires that 2t=i •'^7-uQt-^ti« ^ s invertible. It generally will be if & t Qt®7 is a 
proper variance-covariance matrix (positive semi-definite) and D tjU is full rank. If G t has all-zero rows then 
Q t Qt®7 nas zeros on the diagonal and we have a partially deterministic model. In this case, Qt will have 
all-zero row/columns and D tu Q{Dj )t( will not be invertible unless the corresponding row of T)t,u is zero. 
This means that if one of the x rows is fully deterministic then the corresponding row of u would need to 
be fixed. We can get around this, however. See section [7] on the modifications to the update equation when 
some of the x's are fully deterministic. 

5.2 The general a update equation 

The derivation of the update equation for a with fixed and shared values is completely analogous to the 
derivation for v. We take the derivative of VP with respect to a and arrive at the analogous: 

T T 

a j+1 = ( ]T D^ a R t D t , a ) - 1 ]T T>I a Rt (y, - (xj ® I„) vec(Z t ) - f f , a ) 
t=i t=i 

T T 



(96) 



t=l t=l 



J2t=i D^ a IRtDt ia must be invertible. 

5.3 The general £ update equation, stochastic initial state 

When iEo is treated as stochastic with an unknown mean and known variance, the derivation of the update 
equation for £ with fixed and shared values is as follows. Take the derivative of ^ (using equation [M]) with 
respect to p: 

d^f/dp = (X^L - £ T L) (97) 
Replace £ with + D^p, set the left side to zero and transpose: 

= T>J (Lxo - Lf s + LD c p) (98) 

Thus, 

p J+1 - (DjLD^-'DjLCxo - f e ) (99) 

and the new £ is then, 

£ i+1 =f s + D e p i+1 , (100) 
When the initial state is defined as at t = 1, replace x with Xi in equation [Ml 
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5.4 The general £ update equation, fixed Xq 

For the case, Xq is treated as fixed, i.e. as another parameter, and A does not appear in the equation. It 
will be easier to work with <J/ written as follows: 

rp rp 

E XY [logL(F,X;e)] = -Je xy ( YJy t -Z t X t - at ) T R t (Y t - Z t X t - a t ) + ^log |R t | 



2 

v i i 

(101) 



J2(*t - B t X t _! - u t ) T Q t (X t - B t X t x - u t ) + log iQtl + log27r 



i 



x = + D c p 

This is the same as equation (|90j) except not written in vec form and A does not appear. Take the derivative 
of "J using equation (|101[) . Terms not involving p will drop out: 



(102) 



d^/dp = ~U- E[«9(P7Q 1 B 1 D 5 p)/(9p] - E[9(p T (B 1 D c ) T Q 1 P 1 )/a P ] 

+ E[a(p T (B 1 D 5 ) T Q 1 B 1 D cP )/9p] 
where 

Pi = Xx - Bif € - u x (103) 
After pulling the constants out of the expectations and taking the derivative, we arrive at: 

d*/dp = ~\( - 2E[P 1 ] T Q 1 B 1 D ? + 2p T (B 1 D,) T QiB 1 D^ (104) 

Set the left side to zero, and solve for p. 

p = (D^BjQiBiDeJ-^BjQ^xi - B^ - u x ) (105) 

This equation requires that the inverse right of the = exists and it might not if B t or Qi has any all zero 
rows/columns. In that case, defining £ = X\ might work (section 15. 5|) or the problematic rows of £ could be 
fixed. The new £ is then, 

^ + i=f? + D «P J+ i> (106) 
5.5 The general £ update equation, fixed x\ 

When X\ is treated as fixed, i.e. as another parameter, and A does not appear, the expected log likelihood, 
'J, is written as follows: 

E XY [logL(y,X;e)] = -iE XY (^(y t - Z t X t - at ) T R t (Y t - Z t X t -at) +^log|R 4 | 

^ l i 

v^, \ (107) 

+ J2( X t - B * X *-i - "t) T ®t(X t - B t X t x - ut) + lo S IQtl + lo S 27 

2 2 ' 

xi = f 5 + D ? p 

Take the derivative of *f? using equation (|107[) : 

d*/dp = -~(- E[9(07MiZiD s p)/9p] - E[<9((ZiD 5 p) T IRiOi)/dp] 

+ E[5((Z 1 D ? p) T R 1 Z 1 D ?P )/9p] - E[d(pTQ 2 B 2 D € p)/dp] - E[d((B 2 D c p) T Q 2 P 2 )/dp] (108) 
+ E[d((B 2 D 4 p) T Q 2 B 2 D 4 p)/dp] 
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where 

P 2 =X 2 - B 9 f £ - u 2 

4 (109) 

In terms of the Kalman smoother output the new £ for EM iteration j + 1 when £ = o;i is 

Pj . +1 - ((ZiD € ) T RiZiD f + (B 2 D 5 ) T Q 2 B 2 D c )- 1 ((Z 1 D 5 ) T M 1 6i + (B 2 D 5 ) T Q 2 P 2 ) (110) 

where 

P 2 = x 2 - B 2 f ? - u 2 

~ „ ( m ) 
Oi = yi - Zxf ? - ai 

The new £ is 

C J+1 =f e + D 4Pj+1 , (112) 
5.6 The general B update equation 

Take the derivative of ^ with respect to /3; terms in ^ do not involve /? will equal and drop out. 
= -i W - 9(E[X t T Q t T t D ti6/ 9])/^ - 5(E[(Y t D t . b/ 8) T Q f X f ])/^ 

+ 9( E[(Y t D M /J) T Q t T t D t , b /?])/c>/? + 9( E[u t T Q f T t Tt tjb p])/d0 + d((Y t D t , b /?) T Q t u t )/d/? ( 113 ) 
+ a(E[(T t f t , fc ) T Q t T t D t ,^])/^ + a(E[(T t D ti ^) T Q t T t f t , fc ])/5 J 8' 



where 

T t = (Xj_ 1 ®I m ) (114) 

Since /3 is to the far left or right in each term, the derivative is simple using the derivative terms in table 
E3J becomes: 



d^/dv = ~J2(-2 nxjQ t T t T> t , b ] + 2(/3 T D t T b T7Q t T t D t , fc ) 

(115) 



t=i 



+ 2 E[u7Q t Y t D t , b ] + 2 E[( Y t f t , b ) T Q t Y t D M ] 

Note that X appears in Y 4 but not in other terms. We need to keep track of where X appears so the we 
keep the expectation brackets around any terms involving X. 



d^/dp = ( E[Xj® t Y t ]D t , b - ujQt E[T t ]D ti6 - ft T Dj b E[Y 4 T Q t T t ]D t , 6 - f J >b E[Yj® t Y t ]D tib 
t=i ^ 



(116) 

Set the left side to zero and transpose the whole equation. 

= E ( B lb n?I®tX t } - T>l b E[T t ] T Q t u t - T>l b E[TjQ t r t }f tib ~ Bj tb E[T t T Q t T t ]D t)W fA (117) 

t=i ^ ' 

Thus, 

T T 

( Vlb E[Y t T Q t Y t ]D M )/? = D£ 6 ( E[TjQ t X t ] - E[Y t ] T Q t u 4 - E[Y t T Q t Y t ]f i)6 ) (118) 



t=i t=i 
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Now we need to deal with the expectations. 



E[rjQ t r t ] = e[(jt7_! ® i m ) T Q t (xU ® i m )] 

= E[(X t _x g) I ro )Q i (x7_ 1 ® I m )] 

= E\X t - 1 Xj_ 1 ®Q t ] (119) 

= E[X t _iJT t Li] ® Qt 
= P t _!®Q t 

E[T t T Q t Jf 4 ] - E[(jr7_i ®I m ) T Q t X t ] 
= E[(X t _i ® I m )Q t X t ] 

= E[(JTt_i ® QtJJTt] (120) 
= E[vcc(Q t X t A:7_ 1 )] 
= vec(Q t P M _i) 

E[T t ] T Q t u t = (E[X t _i] ® I m )Q t u t 

= (x t _i «) Q t )u t (121) 
= vec(Q t u t xJ l L 1 ) 



Thus, 

T 



[YPlb(Pt-i®Ot)T>t,b)0 =5^D^(vec(Q t P M _i) - (P*-i O Qt)f*,6 - vec(Q t u t x t T „ 1 )) (122) 



Then j3 for the j + 1 iteration of the EM algorithm is then: 

T x -1 T 



P= (53D^(P t _i®Qt)D t|6 ) x53D^(vec(QtP tjt _i)-(Pt_i®Qt)ft |6 - vec(Q t u t x7_i)) (123) 
^ t=i ' t=i 

This requires that D t b (P t _i ® Qf)D t) & is invertible, and as usual we will run into trouble if § t Qt®7 nas 
zeros on the diagonal. See section [7J 

5.7 The general Z update equation 

The derivation of the update equation for £ with fixed and shared values is analogous to the derivation for 
ft. The update equation for £ is 

T T 

O+i = (^D^j ®Kt)D t ,,)0 x ^D^ 2 (vec(R t yx t ) - (P t ®R t )f M - vec(R t a t x t T )) (124) 
t=i t=i 

This requires that T>J z (Pt ® R()Dt » is invertible. If SiR t S^ has zeros on the diagonal, this will not be 
the case. See section [7J 

5.8 The general Q update equation 

A general analytical solution for Q is problematic because the inverse of Q t appears in the likelihood and Q^ 1 
cannot always be rewritten as a function of vec(Q t ). However, in a few important special — yet quite broad — 
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cases, an analytical solution can be derived. The most general of these special cases is a block-symmetric 
matrix with optional independent fixed blocks (subsection 15.8.5( 1. Indeed, all other cases (diagonal, block- 
diagonal, unconstrained, equal variance-covariance) except one (a replicated block-diagonal) are special cases 
of the blocked matrix with optional independent fixed blocks. 

Unlike the other parameters, I need to put constraints on f and D. I constrain D to be a design matrix. 
It has only Is and Os, and the rows sums are either 1 or 0. Thus terms like q\ + qi are not allowed. A 
non-zero value in f is only allowed if the corresponding row in D is all zero. Thus elements like /i + q\ are 
not allowed in Q. These constraints, especially the constraint that D only has 0s and Is, might be loosened, 
but with the addition of G tl we still have a very wide class of Q matrices. 

The general update equation for Q with these constraints is 

T T 

q J+1 = (^(D^D M ))" 1 ^D^vec(55 t ) 

i=l t=l 

where SS t = $ t (P ( - P t ,t-iBj - B t P t _ M - x t uj - u t xj+ 

B t P t ^Bj + BtStt^uJ + utxJ^Bj + u t uj)<S>J ( 125 ) 
Qt = ft,«+I> M q 
where 

* t = (G7G t )" 1 G7 

The vec of Q t is written in the form of vec(Q t ) = t t , q + Dt, ? <7, where t t , q is a p 2 x 1 column vector of 
the fixed values including zero, T)t,q is the p 2 x s design matrix, and q is a column vector of the s free values 
in Q f . This requires that (T)J T)t,q) be invertible, which in a valid model must be true; if is not true you 
have specified an invalid variance-covariance structure since the implied variance-covariance matrix will not 
be full-rank and not invertible and thus an invalid variance-covariance matrix. 

Below I show how the Q update equation arises by working through a few of the special cases. In these 
derivations the q subscript is left off the D and f matrices. 



5.8.1 Special case: diagonal Q matrix (with shared or unique parameters) 

Let Q be a non-time varying diagonal matrix with fixed and shared values such that it takes a form like so: 



Q = 



01 

















h 

















02 

















h 

















0.2 



Here, /'s are fixed values (constants) and g's are free parameters elements. The / and q do not occur 
together; i.e. there are no terms like /i + q\. 

The vec of Q _1 can be written then as vec(Q _1 ) = f * + D q q*, where f* is like f 9 but with the corre- 
sponding i-th non-zero fixed values replaced by 1/fi and q* is a column vector of 1 over the qi values. For 
the example above, 



Take the partial derivative of \1/ with respect to q*. We can do this because Q 1 is diagonal and thus 
each element of q* is independent of the other elements; otherwise we would not necessarily be able to vary 
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one element of q* while holding the other elements constant. 

T 

2 

t=i 

\TjiTn-l* v l uivTj.Tn-1, 



- E[(B f X f _x) 1 $ t ' Q-^tXt] - E[X t ' $ t ' Q-^tUt] 

- EtuJ^Q- 1 ^] + E[(B t X f _ 1 ) T $7Q- 1 $ t B f X t _ 1 ] (126) 

+ E[(B i X i _ 1 ) T $ i T Q- 1 $ i u t ]+E[u7$ t T Q- 1 $ t B i X t _ 1 ] + u7$7Q- 1 $ t u^/ag* 
-a(|log|Q|)/9g* 

Using the same vec operations as in the derivations for B and Z, pull Q 1 out from the middle and replace 
the expectations with the Kalman smoother outputP^I 



T 

8-f/dq* = ~J2d(E[Xj ®Xj] - E[Xj ® (B t X t _x) T ] - E[(B t X t ^) T ® Xj] 



T 

2 

E[Xj ® u 4 T ] - E[u t T ®Xj] + E[(B i X t _ 1 ) T ® (B t X t _!) T ] 

E[(B t X t _!) T ® u 4 T ] + E[uJ ® (BI W ) T ] + (uj ® uj)) ($ t ® $ 4 ) T vecCQ- 1 )/^* 



a(|log|Q|)/9g* (127) 



2 

7' 



i J] vec(55 t ) T 9( vec(Q- 1 ))/ag* + fl£ log IQ" 1 !)/^ 



2 

t=l 

where 

55 t = * t (P t - P t , t _iB7 - BP t _i,t - x 4 u7 - u t x7 + 
B t P t _!B t T + B t x t _ lU 7 + utx7_!B t T + u t u t T )$7 
This reduction used 

($ t ®$t)(X®jr) = vec(XX T ) = vec($ t vec(XX T )$7)- 

I also replaced log |Q| with — log |Q _1 |; the determinant of a diagonal matrix is the product of its diagonal 
elements. Thus, 

T 

, 2 

r 

2 



dV/dq* = - ( vec(^ t ) T (f* +D q q*) 

(128) 

\ E(log(/r) + log(/ 2 *)...fclog(<£) + llog(q* 2 )...)) /dq* 



where A: is the number of times qi appears on the diagonal of Q and I is the number of times 92 appears, 
etc. Taking the derivatives, 



T T 



OV/dq* == i^DjvecOSSt) -^W/') + -fclogfe*) + Hogfe*)...)/^ 

4=1 t=l 
T T 

= \ E D ^ vec ^ 5 *) - \ E D ^ 



t=l t=i 



13 Another, more common, way to do this is to use a "trace trick", trace(a T Ab) = trace(Aba T ), to pull Q 1 out. 
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D f; D 9 is a s x s matrix with k, I, etc. along the diagonal and thus is invertible; as usual, s is the number 
of free elements in Q. Set the left side to zero (a 1 x s matrix of zeros) and solve for q. This gives us the 
update equation for q and Q: 



%+i = ( E D J D «) 1 E D ^ vec (^*) 
t—i t—\ 

vec(Q)j+i = f + T> q q j+1 
Since in this example, T) q is time-constant, this reduces to 

T 

T> T n.. , r 1 n T ' 

SSt is defined in equation (|127p . 



(130) 



1 T 
q j+1 = -(DjD g )- x D^ vec(SS t ) 



1=1 



5.8.2 Special case: Q with one variance and one covariance 





a 


P 


P 


P~ 




~f(a,P) 


g(a,P) 


g(a,p) 


g{a,P) 


Q = 


P 


a 


P 


P 


Q 1 = 


g(u,P) 


f(a,P) 


g(a,p) 


g{a,P) 


P 


P 


a 


P 


g(a,p) 


g(a,(3) 


f{a,P) 


g(a,p) 




P 


P 


P 


a 




9(a,P) 


g(a,(3) 


g(a,P) 


f{*,p) 



This is a matrix with a single shared variance parameter on the diagonal and a single shared covariance on the 
off-diagonals. The derivation is the same as for the diagonal case, until the step involving the differentiation 
ofloglQ" 1 !: 



T 

d^/dq* = d( - \ J2 ( vec (^) T ) vecCQ- 1 ) + | log |Q" 



/Oq* 



(131) 



It does not make sense to take the partial derivative of log |Q _1 | with respect to vec(Q _1 ) because many 
elements of Q _1 are shared so it is not possible to fix one element while varying another. Instead, we can 
take the partial derivative of log |Q _1 | with respect to g(a,/3) which is j}Gsot g \Q~ 1 \/9q*ij- Set g 
is those i, j values where q* — g(a, (3). Because g() and /() are different functions of both a and (3, we can 
hold one constant while taking the partial derivative with respect to the other (well, presuming there exists 
some combination of a and j3 that would allow that). But if we have fixed values on the off-diagonal, this 
would not be possible. In this case (see below), we cannot hold g() constant while varying /() because both 
are only functions of a: 





a 


f 


f 


f 




/(") 5(a) 


5(a) 


5(a) 


Q = 


f 
f 


a 
f 


f 

a 


f 
f 


Q 1 = 


5(a) /(a) 
5(a) 5(a) 


5(a) 
/(a) 


5(a) 
5(a) 




f 


f 


f 


a 




5(a) 5(a) 


5(a) 


/(a) 



Taking the partial derivative of log |Q 
as for the diagonal matrix: 



with respect to q* = [ ^["'oj ] , we arrive at the same equation 



T T 

d^/dq* = \ D T vec(SS t ) - \ E(D T D)<Z 



(132) 



t=i 



where here D T D is a 2 x 2 diagonal matrix with the number of times /(«,/?) appears in element (1, 1) and 
the number of times g{a, f3) appears in element (2, 2) of D; s = 2 here since there are only 2 free parameters 
in Q. 

Setting to zero and solving for q* leads to the exact same update equation as for the diagonal Q, namely 
equation (| 130[) in which f q = since there are no fixed values. 
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5.8.3 Special case: a block-diagonal matrices with replicated blocks 

Because these operations extend directly to block-diagonal matrices, all results for individual matrix types 
can be extended to a block-diagonal matrix with those types: 



Q 





B 2 
B 3 



where Bj is a matrix from any of the allowed matrix types, such as unconstrained, diagonal (with fixed or 
shared elements), or equal variance-covariance. Blocks can also be shared: 



Q 



o 







i); one cannot simply share individual elements in different 



but the entire block must be identical (B2 ; 
blocks. Either all the elements in two (or 3, or 4...) blocks are shared or none are shared. 
This is ok: 

c d d 

d c d 

d d c 

c d d 

d c d 

d d c 



This is not ok: 



c d d 

d c d 

d d c 













c d 

d c 



nor 



c 


d 


d 











d 


c 


d 











d 


d 


c 




















c 


e 


e 











e 


c 


e 











e 


e 


c 



The first is bad because the blocks are not identical; they need the same dimensions as well as the same 
values. The second is bad because again the blocks are not identical; all values must be the same. 



5.8.4 Special case: a symmetric blocked matrix 

The same derivation translates immediately to blocked symmetric Q matrices with the following form: 



Q 



JE-l 
Cl, 2 
Cl, 3 



E 



2 
2,3 



*-l,3 
C 2 , 3 

E 3 



where the E are as above matrices with one value on the diagonal and another on the off-diagonals (no 
zeros!). The C matrices have only one free value or are all zero. Some C matrices can be zero while are 
others are non-zero, but a individual C matrix cannot have a combination of free values and zero values; 
they have to be one or the other. Also the whole matrix must stay block symmetric. Additionally, there can 
be shared E or C matrices but the whole matrix needs to stay block-symmetric. Here are the forms that E 
and C can take: 



E, 



P 
a 
P 
P 



P 
p 
a 
P 



d = 



X 


X 


X 


X 




"0 











X 


X 


X 


X 


or 














X 


X 


X 


X 














X 


X 


X 


x. 
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The following are block-symmetric: 



Ei 


Cl.2 


Ci, 3 




E 


C C 


"1,2 


E 2 


C 2 , 3 


and 


C 


E C 


"1,3 


C 2 , 3 


E 3 _ 




C 


C E 






"Ei 


d C 1;2 " 






and 


Ci 


Ei 


-1,2 








Cl,2 


Ci,2 


E 2 _ 





The following are NOT block-symmetric: 





Cl,2 







Ei 




Ci 






Ei 







Cl. 2 


"1,2 


E 2 


C 2 . 3 


and 





Ei C 2 


and 







Ei 




Cl,2 





C 2 ,3 


E 3 _ 




Ci 


C 2 E 2 _ 






Cl,2 


Ci, 


2 


E 2 






"Ui 


Cl,2 


Ci.3 






" Di 


( 


'1,2 


Ci, 3 " 








and 


Cl. 2 


E 2 


C 2 . 3 




and 


Cl,2 


E 2 


C 2 , 3 










Ci.3 


C 2 , 3 


E 3 






Cl, 3 


C 2 , 3 


E 3 







In the first row, the matrices have fixed values (zeros) and free values (covariances) on the same off-diagonal 
row and column. That is not allowed. If there is a zero on a row or column, all other terms on the off-diagonal 
row and column must be also zero. In the second row, the matrix is not block-symmetric since the upper 
corner is an unconstrained block (Ui) in the left matrix and diagonal block (Pi) in the right matrix instead 
of a equal variance- covariance matrix (E) . 



5.8.5 The general case: a block-diagonal matrix with general blocks 



In it's most general form, Q is allowed to have a block-diagonal form where the blocks, here called G are 
any of the previous allowed cases. No shared values across G's; shared values are allowed within G's. 



Q 



Gi 







G 2 
G 3 



The G's must be one of the special cases listed above: unconstrained, diagonal (with fixed or shared values), 
equal variance-covariance, block diagonal (with shared or unshared blocks), and block-symmetric (with 
shared or unshared blocks). Fixed blocks are allowed, but then the covariances with the free blocks must be 
zero: 



Q 



Fixed blocks must have only fixed values (zero is a fixed value) but the fixed values can be different from 
each other. The free blocks must have only free values (zero is not a free value). 



F 














Gi 














G 2 














G 3 



5.9 The general R update equation 

The R update equation for blocked symmetric matrices with optional independent fixed blocks is completely 
analogous to the Q equation. Thus if R has the form 



R = 



F 














Gi 














G 2 














G 3 
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Again the G's must be one of the special cases listed above: unconstrained, diagonal (with fixed or shared 
values), equal variance- covariance, block diagonal (with shared or unshared blocks), and block-symmetric 
(with shared or unshared blocks). Fixed blocks are allowed, but then the covariances with the free blocks 
must be zero. Elements like /j + Vj and r» + Tj are not allowed in R. Only elements of the form /, and r% 
are allowed. If an element has a fixed component, it must be completely fixed. Each element in R can have 
only one of the elements in r, but multiple elements in R can have the same r element. 
The update equation is 

^ t=\ ' \ t=l 

vec(R) i)J - + i = f t ,r + D tjr rj + i 

The R t j + i used at time step t in equation Q133[) is the term that appears in the summation in the uncon- 
strained update equation with no missing values (equation I53|) : 

R M+1 = E t (b t - fk t Z] - Z t fk T t - y t a7 - a t y t T + Z t P t Z t T + Z t x t a7 + ai t xjzj + a t a^ Ej (134) 
where E t = (Hjllt)^ 1 !!^ . 



(133) 



6 Computing the expectations in the update equations 

For the update equations, we need to compute the expectations of X t and Y t and their products conditioned 
on 1) the observed data 1^(1) = y(l) and 2) the parameters at time t, Qj. This section shows how to compute 
these expectations. Throughout the section, I will normally leave off the conditional 1^(1) = y(l),®j when 
specifying an expectation. Thus any E[] appearing without its conditional is conditioned onF(l) = y(l), Qj. 
However if there are additional or different conditions those will be shown. Also all expectations are over 
the joint distribution of XY unless explicitly specified otherwise. 

Before commencing, we need some notation for the observed and unobserved elements of the data. The 
n x 1 vector y t denotes the potential observations at time t. If some elements of y t are missing, that means 
some elements are equal to NA (or some other missing values marker): 



yi 

NA 
2/3 

2/4 

NA 

ye 



(135) 



We denote the non-missing observations as y t (l) and the missing observations as 2/ t (2). Similar to y t , Y t 
denotes all the Y random variables at time t. The Y t 's with an observation are Y t (l) and those without an 
observation are denoted Y t {2). 

Let flj be the matrix that extracts only Y t (l) from Y t and Jl t (2) be the matrix that extracts only 
Y t (2). For the example above, 



Y t (i) = n^Y t , n\ 



Y t {2) = nf'Yt, n[ 



1 

1 

1 

1 

1 

1 



(136) 
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We will define another set of matrices that zeros out the missing or non-missing values. Let Ij denote 
a diagonal matrix that zeros out the Y t (2) in Y t and I t denote a matrix that zeros out the Y t (l) in Y t . 
For the example above, 



i; 



(2) - (^ 2) ) T ^ 2) 



1 



1 

1 



1 

0' 

1 





1 





and 



(137) 



6.1 Expectations involving only X t 

The Kalman smoother provides the expectations involving only X t conditioned on all the data from time 1 
to T. 



it = E[X t ] 

V t = vav[X t ] 

V t ,t-i = cav\X t ,X t -i] 

From x t , Vt, and V fj t_i, we compute 

P t = E[X t Xj] = V t +x t xJ 



(138a) 
(138b) 
(138c) 

(138d) 
(138e) 



The Pt and P^t-i equations arise from the computational formula for variance (equation I12[) . Note the 
smoother is different than the Kalman filter as the filter does not provide the expectations of X t conditioned 
on all the data (time 1 to T) but only on the data up to time t. 

The first part of the Kalman smoother algorithm is the Kalman filter which gives the expectation at time 
t conditioned on the data up to time t. The following the filter as shown in (jShumwav and Stofferl . 120061 
sec. 6.2, p. 331), although the notation is a little different. 



V*- 1 = B t V t t Z 1 1 Bj + G t Q t G] 
A =x\~ 1 +Kt(y t -Zt**" 1 -a«) 



t-i 



V* = (I m - K t Z t )V 
K t = V^Z^ZtV^Z^ + HtRtH^)- 1 



(139a) 
(139b) 
(139c) 
(139d) 
(139e) 



The Kalman smoother and lag-1 covariance smoother compute the expectations conditioned on all the 
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data, 1 to T: 



H-x 



= x 



1 

t-i 



+ Jt- 



v^v^ + j^v^-vj- 1 )^ 



v T.T-l 

v t-l,t-2 — v t-l J i-2 



(I - K T Z T )B T V^:^ 

Jt-iCCVTt.t-BtV*!!))^ 



(140a) 
(140b) 
(140c) 
(140d) 
(140e) 
(140f) 



The classic Kalman smoother is an algorithm to compute these expectations conditioned on no missing 
values in y. However, the algorith m can be easily modified to g ive the expected values of X conditioned on 
the incomplete data, 1^(1) = y(l) (jShumwav and Stofferl , I2006L sec. 6.4, eqn 6.78, p. 348). In this case, the 
usual filter and smoother equations are used with the following modifications to the parameters and data 
used in the equations. If the i-th element of y t is missing, zero out the i-th rows in y t , a and Z. Thus if the 
2nd and 5th elements of y t are missing, 



(141) 



The R t parameter used in the filter equations is also modified. We need to zero out the covariances 
between the non-missing, y t (l), and missing, y t (2), data. For the example above, if 



yi 




ai 




Zl,l 


Zl,2 


















2/3 


> a t = 


«3 


, Z* = 


23,1 


23,2 


2/4 




0,4 




Z4,l 


24,2 






















_a e _ 




.26,1 


^6,2 





ri,2 


ri,3 


ri 


4 


r l,5 


n,6 


r 2,l 


f2,2 


^2,3 




4 


r 2,5 




r 3,l 


?"3,2 


^3,3 




4 


?~3,5 


^3,6 


r 4,l 


?"4,2 


^4,3 


T4 


4 


?"4,5 


Tifi 


r 5,l 


r 5,2 


^5,3 




4 


?"5,5 


^5,6 


r 6,l 


^6,2 


^6,3 


ra 


4 


?~6,5 


^6,6 
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then the R/ we use at time t. will have zero covariances between the non-missing elements 1,3.4,6 and the 
missing elements 2,5: 








n.3 


ri,4 





n,6 





^2,2 








r 2,5 





r 3,l 





^3,3 


r 3,4 





^3,6 


r 4,l 





r 4,3 


»*4,4 





^4,6 





r 5,2 








7*5,5 





r 6,l 





^6,3 


re,4 





re,6 



Thus, the data and parameters used in the filter and smoother equations are 



(143) 



H hit 



a t * = I«a t 
Z t *=I«Z t 



(144) 



Rf 



IP)RJP) 



aj 1 , Zj and R^ only are used in the Kalman filter and smoother. They are not used in the EM update 
equations. However when coding the algorithm, it is convenient to replace the NAs (or whatever the missing 
values placeholder is) in y t with zero so that there is not a problem with NAs appearing in the computations. 
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6.2 Expectations involving Y t 

First, replace the missing values in y t with zeros3 and then the expectations are given by the following 



equations. The derivations for these equations are given in the subsections to follow. 

y t = E\Y t ] =y t - V t (y t - Z f x f - a t ) (145a) 

6 t = E[Y t Yj] = if 5 (V 4 H t R t H t T + V t Z t V t Z t T v7)lf 5 + y t yj (145b) 

fk t = E\Y t Xj] = V t Z 4 V t + y t xj (145c) 

yx M _ x = ElYtXj^} - VtZtV t) t-i + (145d) 

where V t = I - H t R t H t T (0| 1) ) T (n[ 1) H t R t H7(ni 1) ) T ) _1 n| 1) (145c) 

andl^ = (^ (2) ) T ^ 2) (145f) 



If y t is all missing, fl^ 1 ' is a x n matrix, and we define (fi| 1 ^) T (f2( 1 ' l R(Jlj 1 ' l ) T ) to be a n X n matrix 

of zeros. If R t is diagonal, then R t (ft t (1) ) T {n[ 1) R. t (n ( t 1) ) T )- 1 n ( t 1) = and V t = l[ 2) . This will mean that 
in y t the y t (2) are given by Z t x t + a t , as expected when y t (l) and y t (2) are independent. 

If there are zeros on the diagonal of Rt (section [7|) , the definition of Vt is changed slightly from that 
shown in equation 11451 Let 15^ be the matrix that extracts the elements of y t where y t (i) is not missing 
AND H t R t (?,i)H t T is not zero. Then 

Vt = I — H t R t H7(^ r) ) T (^ r) H t R t H t T (^ r) ) T )- 1 ^ r) (146) 



6.3 Derivation of the expected value of Y t 



In the MARSS equation, the observation errors are denoted H t v t . v t is a specific realization from a random 
variable Vf that is distributed multivariate normal with mean and variance Rt . V t is not to be confused 
with V t in equation 11381 which is unrelated^! to Vf. If there are no missing values, then we condition on 
Y t = y t and 



E\Y t \Y(l)=y(l)} = E\Y t \Y t =y t ]=y t 
If there are no observed values, then 

E[Y t \Y(l) = y(l)] = E[Y t ] = E[Z t X t +a t + V t ) = Z t X t + a t 



(147) 



(148) 



If only some of the Y t are observed, then we use the conditional probability for a multivariate normal 
distribution (here shown for a bivariate case): 



~Y{ 


- MVN ( 


Mi 






) 








S21 S22 





Then, 



If, 



(Yi|Fi =yi) =yi, and 

(Y 2 \Yi = yi) ~ MVN(/1, E), where 
p, = ^2 + S 2 i [yi - Mi) 
£ = S22 — S2iE 11 1 Si2 



(149) 



(150) 



14 The only reason is so that in your computer code, if you use NA or NaN as the missing value marker, NA-NA=0 and 
0*NA=0 rather than NA. 

15 1 apologize for the confusing notation, but Vt and vt are somewhat standard in the MARSS literature and it is standard 
to use a capital letter to refer to a random variable. Thus Vt would be the standard way to refer to the random variable 
associated with vt. 
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From this property, we can write down the distribution of Y t conditioned on Y t (l) = j/ t (1) and X t = Xt'- 



Y t (l)\X t = x t 
Y t (2)\X t =Xt 



mvn( 



n^iZtXt+at) 



(H t R t H7)n (H(R t H7)i2 
(H t R t H t )2i (H(R t H t )22. 



(151) 



Thus, 



(T t (l)|y t (l) = » t (l), JT t = s t ) = njSt and 
(F i (2)|F t (l) - y t (l), JT t = x t ) ~ MVN(/i, S) where 

/i = n t (2) (Z t x t + a 4 ) + R t> 2i(Rt,n) _1 nf - Z t x t - a t ) 

S = Rt,22 — Rt,2l(Rt,u) 1 Rt,12 

R f = HfRfH^ 

Note that since we are conditioning on X t = Xt, we can replace Y by Y t in the conditional: 
E\Y t \Y(l)=y(l),X t =x t ] = E\Y t \Y t (l)=y t (l),X t =x t }. 
From this and the distributions in equation (I152[) . we can write down y t = E[Y"t|Y"(l) 

y t =E XY \Y t \Y(l)=y(l)} 



(152) 



2/(1) ,Qj 



/ / Vtf{yt\yt{ l ),xt)dy t f{xt)dx t 
Jx t Jy t 

: E x [E y]:c [F t |F t (l) = 2 / t (l),X t =a: t ]] 
: E x \y t -V t (y t -Z t X t -EL t )] 
Vt ~ Vt(|/ 4 -Z t x t -a 4 ) 



(153) 



where V* = I - R t (nj 1) ) T (R tiU )- l n 



rioW 



(^l| 1 ' ) ) T (R ^I ll) _1 ^2t i • , isanxn matrix with Os in the non-(ll) positions. If the fc-th element of y t is observed, 
then k-th row and column of Vt will be zero. Thus if there are no missing values at time t, Vt = I — I = 0. 
If there are no observed values at time t, V t will reduce to I. 

6.4 Derivation of the expected value of Y,Y} 

The following outlines J^l derivation. If there are no missing values, then we condition on Y t = y t and 

E\Y t Yj\Y(l)=y(l)} = E[Y t Yj\Y t =y t ] 

= y t yJ- 

If there are no observed values at time t, then 
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= var[Z t X t + a t + H t V t ] + E[Z t X t + a t + H t Vt] E[Z t X t + a t + H t V f ] T 

= var[Vf] + var[ZtX t ] + (E[Z t X t + a t ] + E[H t V t ])( E[Z t X t + a t ] + E[HfV f ]) T 

= Rt + ZtV t z7 + (Z t xt + a t )(Z t x t + a t ) T 



(155) 



16 The following derivations are painfully ugly, but appear to work. There are surely more elegant ways to do this; at least, 
there must be more elegant notations. 
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When only some of the Y t are observed, we use again the conditional probability of a multivariate normal 
(equation 1 149p . From this property, we know that 

va,T Y]x \Y t {2)Y t (2) T \Y t (l) =y t (l),X t = x t ] = Rt, 22 - R t ,2i(Rt,n) _1 Rt,i2, 
vai Y]x \Y t (l)\Y t (l) = y t (l),X t =x t ] = 
and cov Y \ w \r t (l),Y t (2)\Yt(l) =y t (l),X t = x t ] =0 



Thus vax Ylx [Y t \Y t (l)=y t (l),X t =x t ] 

= (^t ) T (R-t,22 — Rt,2i(R{,ii) 1 R«, 12)^4 

,(2) 



(156) 



= lf(R t -R 4 ( n «)^(R 4>11 )- 1 ^ 1) R i )lf 
= l| 2) V t R t l| 2) 

The IJ bracketing both sides is zero-ing out the rows and columns corresponding to the y t (l) values. 

Now we can compute the E^y [KtY" t |Y"(1) = 2/(1)]. The subscripts are added to the E to emphasize 
that we are breaking the multivariate expectation into an inner and outer expectation. 

6 t = E XY [Y t Yj\Y(l) = y(l)} = E x [E Y{x [Y t Yj\Y t (l) = y t (l),X t = x t }] 
= E x [var Y]x [Y t \Y t (l)=y t (l),X t =x t ] 

+ E Ylx \Y t \Y t (l) =y t (l),X t =x t ]E Ylx \Y t \Y t (l) =y t (l),X t = x t ] T ] 
= E^lfV^I^] + E x [(y t - V t (y t - Z t X t - a,))(y t - V t (y t - Z t X t - a t )) T ] (157) 
= I< 2) V t RtI< 2) + var x [y t - V t {y t - Z t X t - a*)] 

+ E x [y t - V t (y t - Z t X t - a t )] E x [y t - V t (y t - Z f X f - a t )] T 

= ifV t R t i[ 2 > + ifV t z t v,z7v7ii 2) + y t y7 



Thus, 



O t = if 5 (V*Rt + VtZtVtZj V t T )I t (2) + y t y t T (158) 



6.5 Derivation of the expected value of Y,X, 

If there are no missing values, then wc condition on Y t — y t and 

E[Y t Xj \Y(l) = = j/ t E[X t T ] = y t xj (159) 

If there are no observed values at time t, then 

E[Y t Xj\Y(l)=y(l)} 

= E[{Z t X t + at + V t )Xj] 

= E[Z t X t Xj + a t Xj + V t Xj] (160) 
= Z t P t + at*7 + cov[V t ,X t ] + E[V t ] E[X t ] T 
= Z t P t + a t x7 

Note that V t and X t are independent (equation [TJ. E[V f ] = and cov[V t ,X t ] = 0. 
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Now we can compute the E XY [Y t X t \Y(1) — y(l)]. 

ykt = E XY [Y t Xj\Y(l)=y(l)} 

= cov\Y u X t \Y t (l) =y t (l)} + E XY [Y t \Y(l) = y(l)]E XY [Xj \Y(1) =y(l)] T 
= cov[j/ t - V t {y t ~ z tX t - a t ) + Vj ,X t ] + y t x t T 

= cov[y t ,X t ] - cov[V t y t ,X t } + cov[V t Z t X t ,X t } + cov[V t a t ,X t ] (161) 

+ cov[V*,X t Ky t X7 
= - + V t Z t V f + + + y t SiJ 

= V t Z t V f +y t x7 

This uses the computational formula for covariance: E[KX T ] = cov\Y,X] + Fj[Y]E[X] t . VJ" is a ran- 
dom variable with mean and variance — R.t.2i(R-t,n) _1 R't,i2 from equation (|152|) . Vj and X t are 
independent of each other, thus cov[V^,Xj] = 0. 

6.6 Derivation of the expected value of YfXj^ 

The derivation of ~E\YtXj_ 1 ] is similar to the derivation of Ep^X^]: 

y5c t = E XY \Y t Xj_ 1 \Y(l)=y(l)} 

= cav\y t ,Xt-i\Y t (l) = y t (l)} + E XY [Y t \Y(l) = y(l)] Exy\XJ-i\Y(1) = y(l)] T 
= cov[y t - Vt(y t - Z t X t - at) + VJ, JT t _i] + y t X t T _i 

= cov[2/ ( ,X t _i] - cov[V t 2/ t ,X t _i] + cov[V t Z t X t ,X t _i] (162) 

+ cov[V t a t ,X 4 _i] + cov[Vj,X t _i] +y t xj_ 1 
= - + VtZtVt,*-! + + + y f x7_i 
= VtZ t Vt >t _i+y t x t T _i 

7 Degenerate variance models 

It is possible that the model has deterministic and probabilistic elements; mathematically this means that 
either Gt, Hi or F have all zero rows, and this means that some of the observation or state processes are 
deterministic. Such models often arise when a MAR-p is put into MARSS-1 form. Assuming the model is 
solvable (one solution and not over-determined), we can modify the Kalman smoother and EM algorithm to 
handle models with deterministic elements. 

The motivation behind the degenerate variance modification is that we want to use one set of EM update 
equations for all models in the MARSS class — regardless of whether they are partially or fully degenerate. 
The difficulties arise in getting the u and £ update equations. If we were to fix these or make £ stochastic 
(a fixed mean and fixed variance), most of the trouble in this section could be avoided. However, fixing £ or 
making it stochastic is putting a prior on it and placing a prior on the variance- covariance structure of £ that 
conflicts logically with the model is often both unavoidable (since the correct variance-covariance structure 
depends on the parameters you are trying to estimate) and disasterous to one's estimation although the 
problem is often difficult to detect especially with long time series. Many papers have commented on this 
subtle problem. So, we want to be able to estimate £ so we do not have to specify A (because we remove 
it from the model). Note that in a univariate x model (one state), A is just a variance so we do not run 
into this trouble. The problems arise when x is multivariate (^1 state) and then we have to deal with the 
variance-covariance structure of the initial states. 
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7.1 Rewriting the state and observation models for degenerate variance systems 

Let's start with an example: 



and H* 



1 

1 



(163) 



Let fl^ r be a p x n matrix that extracts the p non-zero rows from H t . The diagonal matrix (fi^ r ) T fl^ r = 
is a diagonal matrix that can zero out the H t zero rows in any n row matrix. 



fit 



1 
1 



(n+) T n+ 



1 

1 



(164) 



vt = ntrVt 



Let fl[°} bea(n-p)xn matrix that extracts the n — p zero rows from H t . For the example above, 



n<°> 



[0 1 0] 



Vt — iL t,rVt 



li 0) = (fi( 0) ) T ^ 0) 

*i,r \ t,r I u "i,r 




1 




(165) 



Similarly, Slf extracts the non-zero rows from G t and fl[° q extracts the zero rows. l£ and are defined 



similarly. 

Using these definitions, we can rewrite the state process part of the MARSS model by separating out the 
deterministic parts: 



4 0) 



ni 0) JB t x t 



t,q \,*->t-"t-i -r Ut) 
x+ = n+ q x t = Sl+JBtXt-i +u t + G t w t ) 



Xo 



MVN(0, Q t ) 
MVN(£,A) 



(166) 



Similarly, we can rewrite the observation process part of the MARSS model by separating out the parts with 
no observation error: 



Vt 



n+Vt = n£r( z t*t + a * + H tVt ) 



(167) 



«+ (Z t I+* t + Ztig'zt + a t + H t v t ) 
MVN(0,Rt) 



I am treating A as fully stochastic for this example, but in general F might have rows. 

In order for this to be solvable using an EM algorithm with the Kalman filter, we require that no estimated 



B or u elements appear in the equation for y[°^ . Since the y\"' do not appear in the likelihood function 
(since h| ' = 0), would not affect the estimate for the parameters appearing in the y^ equation. This 
translates to the following constraints, (li X m <E> fl[°}z t I^q)'Dt,b is all zeros and fi^Ztl^D,, is all zeros. 
Also notice that Cl[°^Z t and fl[ }a. t appear in the equation and not in the y + equation. This means that 
ll| jz t and rij° r 'a t must be only fixed terms. 



(o) 
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In summary, the degenerate model becomes 





= b|°W 1 + u| 0) 


i 


= B+x 4 _i+u+ + G+w t 


w t 


~ MVN(0, Q t ) 


x 


~ MVN(£, A) 




= z(°)l+x t + Z( )l(°)x t + ai 0) 


vi 


= Z+x t + a+H+v t 




= Z+I+x t + Z+l(°) a:t + a+ + H+v t 


v* 


- MVN(0,R) 



(168) 



where B^ = 9 B t and B^ = flf q B t so that Bj are the rows of B t corresponding to the zero rows of G t 
and B/" are the rows of B t corresponding to non-zero rows of G t . The other parameters are similarly defined: 
u[ 0) = and u+ - n+ q u t , z[ 0) = ng>Z t and Z+ = n+.Z t) and af - flga* and a+ = n+ r a t . 

7.2 Identifying the fully deterministic x rows 

To derive EM update equations, we need to take the derivative of the expected log-likelihood holding ev- 
erything but the parameter of interest constant. If there are deterministic x t rows, then we cannot hold 
these constant and do this partial differentiation with respect to the state parameters. We need to identify 
these Xt rows and remove them from the likelihood function by rewriting them in terms of only the state 
parameters. For this derivation, I am going to make the simplifying assumption that the locations of the 
rows in Gt and H t arc time-invariant. This is not strictly necessary, but simplifies the algebra greatly. 

For the deterministic Xt rows, the process equation is Xt = B t a; t i + u t , with the w t term left off. When 
we do the partial differentiation step in deriving the EM update equation for u, B or £, we will need to take 
a partial derivative while holding x t and x t -i constant. We cannot hold the deterministic rows of x t and 
Xt~i constant while changing the corresponding rows of u t and B t (or £ if t = or t = 1). If a row of Xt is 
fully deterministic, then that x i]t must change when row i of u t or B t is changed. Thus we cannot do the 
partial differentiation step required in the EM update equation derivation. 

So we need to identify the fully deterministic Xt and treat them differently in our likelihood so we can 
derive the update equation. I will use the terms 'deterministic', 'indirectly stochastic' and 'directly stochastic' 
when referring to the x t rows. Deterministic means that that x t row (denoted xf ) has no state error terms 
appearing in it (no w terms) and can be written as a function of only the state parameters. Indirectly 
stochastic (denoted x\ s ) means that the corresponding row of G t is all zero (an xf" 1 row), but the Xt row 
has a state error term (w) which it picked up through B in one of the prior B t a; t steps. Directly stochastic 
(the xf) means that the corresponding row of G t is non-zero and thus these row pick up at state error term 
(■Wt) at each time step. The stochastic Xt are denoted xf whether they are indirectly or directly stochastic. 

How do you determine the d, or deterministic, set of Xt rows? These are the rows with no w terms, from 
time t or from prior t, in them at time t. Note that the location of the d rows is time-dependent, a row may 
be deterministic at time t but pick up a to at time t + 1 and thus be indirectly stochastic thereafter. I am 
requiring that once a row becomes indirectly stochastic, it remains that way; rows are not allowed to flip 
back and forth between deterministic (no w terms in them) and indirectly stochastic (containing a w term). 

I will work through an example and then show a general algorithm to keep track of the deterministic 
rows at time t. 

Let Xq = £ (so F is all zero and Xq is not stochastic). Define l\ s , and 1^ as diagonal indicator 
matrices with a 1 at i) if row i is directly stochastic, indirectly stochastic, or deterministic respectively. 
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I* + \\ s + if = I m . Let our state equation be X t = B t X t _i + G t w t . Let 



At t = 



At t = 1 



At t = 2 



r - 



By i = 3, the system stabilizes 
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(169) 



(170) 



(171) 



(172) 



(173) 



(174) 



(175) 



(176) 



(177) 



After time t = 3 the location of the deterministic and indirectly stochastic rows is stabilized and no longer 
changes. 

In general, it can take up to m time steps for the location of the deterministic rows to stabilize. This 
is because B t is like an adjacency matrix, and I require that the location of the 0s in B1B2 . . .B t is time 
invariant. If we replace all non-zero elements in B t with 1, then we have an adjacency matrix, let's call it 
M. If there is a path in M from x^ t to an x Sjt , then row j will eventually be indirectly stochastic. Graph 
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theory tells us that it takes at most m steps for a m x m adjacency matrix to show full connectivity. This 
means that if element j, i is in M m then row j is not connected to row i by any path and thus will remain 
unconnected for M t>m ; note element i, j can be while j, i is not. 

This means that B1B2 . . . B 4 , t > m, can be rearranged to look something like so where ds are directly 
stochastic, is are indirectly stochastic, and d are fully deterministic: 



ds 


ds 


ds 


ds 


ds 


ds 


ds 


ds 


ds 


ds 


is 


is 


is 


is 


is 
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d 











d 


d 



(178) 



The (is's, is's and <i's are not all equal nor are they necessarily all non-zero; I am just showing the blocks. 
The d rows will always be deterministic while the is rows will only be deterministic for t < m time steps; 
the number of time steps depends on the form of B. 

Since my B t matrices arc small, I use a inefficient strategy in my code to construct the indicator matrices 
1^. I define M as B t with the non-zero B replaced with 1; I require that the location of the non-zero 
elements in Bt are time- invariant so there is only one M. Within the product M*, those rows where only 
Os appear in the 'stochastic' columns (non-zero Gt rows) are the fully deterministic Xt+i rows. Note, t + 1 
so one time step ahead. There are much faster algorithms for finding paths, but my M tend to be small. 
Also, unfortunately, using B1B2 . . . B t , needed for the xf function, in place of M* is not robust. Let's say 
B = [rj 1 Tj 1 ] and G = [J]. Then B 2 is a matrix of all zeros even though the correct 1% is [§§] not [§ ?]• 



7.2.1 Redefining the xf elements in the likelihood 

By definition, all the B t elements in the ds and is columns of the d rows of B t are (see equation [T7S]). This 
is due to the constraint that I have imposed that locations of Os in B t are time-invariant and the location of 
the zero rows in G t also time-invariant: and I^ -* are time-constant. 

Consider this B and G, which would arise in a MARSS version of an AR-3 model: 
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Using Xq = £: 
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The . . . just represent 'some values'. The key part is the w appearing which is the stochasticity. At t = 1, 
rows 2 and 3 are deterministic. At t = 2, row 3 is deterministic, and at t — 3, no rows are deterministic. 
We can rewrite the equation for the deterministic rows in Xt as follows. Note that by definition, all the 
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non-d columns in the d-rows of B t are zero, xf is x t with the d rows zeroed out, so xf = I^jXt. 
xf = B?x + uf 

= I?(Bi* + f«,i + 
xJj = B^xi + 

= B^BxXo + f„,i + D u>1 t;)) + f£ 2 + D* 2 t; (181) 
= I^B 2 I^(lf(B 1 x + f„,i + D Ui i«)) + lgf u>2 + I^D„ i2 u 
= I^BaBxXo + lg(B 2 fi, u + f 2 , u ) + I 2 (B 2 D Uil + D u , 2 )t; 
= I^(B 2 Bi* + B 2 fi,„ + f 2 ,„ + (B 2 D Uil + D tt , 2 )u) 



The messy part is keeping track of which rows are deterministic because this will potentially change up to 
time t = m. 

We can rewrite the function for xf, where to is the t at which the initial state is defined. It is either t — 
or t = 1. 

xf = I*(B* t x t0 +r t +-D* t v) 
where 
B t * = I m 

it = B t f t *_! + f tiU (182) 

D t * =B t D t *_ 1 +D t , u 
T d — T 

diag(lf o+T ) - apply(n(°)M T n+ == 0, 1, all) 

The bottom line is written in R: if +T is a diagonal matrix with a 1 at where row i of G is all and 
all ds and is columns in row i of M* are equal to zero. 

In the expected log-likelihood, the term E[Xf] = E[.X"f |y = y], meaning the expected value of Xf 
conditioned on the data, appears. Thus in the expected log-likelihood the function will be written: 

xt = it(B* t x t0 + r t +n* t v) 

E[Xf]=lf(B* t E[X to }+r t +T>* t v) 

When the j-th row of F is all zero, meaning the j'-th row of Xq is fixed to be £j, then E[X^ 0i j] = £j. This 
is the case where we treat x to j as fixed and we either estimate or specify its value. If x to is wholly treated 
as fixed, then E[X tf) ] = £ and A does not appear in the model at all. In the general case, where some x to j 
are treated as fixed and some as stochastic, we can write E[Xf ] appearing in the expected log-likelihood as: 

E[X t0 ] = (I m -li 0) )E[X to ]+li 0) £ (184) 
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1^ is a diagonal indicator matrix with 1 at if row j of F is all zero. 

If B d ' d and u d are time-constant, we could use the matrix geometric series: 

t-i 

xf =(B d ' d yx$ + V (B d > d yu d = (B d ' d y X d + (I - B d,d ) _1 (I - {B d ' d Y)u d , if B d - d ^ I 

*=o ( 185 ) 
+ u d , if B d ' d = I 

where B d ' d is the block of gTs in equation 11781 

7.2.2 Dealing with the x\ s elements in the likelihood and associated parameter rows 

Although w'f = 0, these terms are connected to the stochastic x's in earlier time steps though B, thus all 
x\ s are possible for a given u t , B t or However, all x\ s are not possible conditioned on Xt-ij so we are back 
in the position that we cannot both change Xt and change u*. 

Recall that for the partial differentiation step in the EM algorithm, we need to be able to hold the -E'l-X'*] 
appearing in the likelihood constant. We can deal with the deterministic Xt because they are not stochastic 
and do not have 'expected values'. They can be removed from the likelihood by rewriting xf in terms of the 
model parameters. We cannot do that for x\ s because these x are stochastic. There is no equation for them; 
all x ls are possible but some are more likely than others. We also cannot replace x\ s with B\ a E[X t ~i] + uj s 
to force B l t s and u ls to appear in the y part of the likelihood. The reason is that E[X t ] and E[-X" t _i] both 
appear in the likelihood and we cannot hold both constant (as we must for the partial differentiation) and 
at the same time change B l t s or u l t s as we are doing when we differentiate with respect to B l t s or u)J s . We 
cannot do that because x\ s is constrained to equal B\ s x t -i + u\ s . 

This effectively means that we cannot estimate B] s and uj s because we cannot rewrite x\ s in terms of 
only the model parameters. This is specific to the EM algorithm because it is an iterative algorithm where 
the expected X t are computed with fixed parameters and then the E[X t ] are held fixed at their expected 

values while the parameters are updated. In my B update equation, I assume that b| ^ is fixed for all t. 
Thus I circumvent the problem altogether for B. For u, I assume that only the u IS elements are fixed. 

7.3 Expected log-likelihood for degenerate models 

The basic idea is to replace I d Fi[X t ] with a deterministic function involving only the state parameters (and 
E[X to ] if X to is stochastic) . These appear in the y part of the likelihood in Z t X t when the d columns of 
Z t have non-zero values. They appear in the x part of the likelihood in B t X t ~i when the d columns of B t 
have non-zero values. They do not appear in X t in the x part of the likelihood because Qt has all the non-s 
columns and rows zeroed out (non-s includes both d and is) and the element to the left of Qt is a row vector 
and to the right, it is a column vector. Thus any xf in X t are being zeroed out by Q t . 
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The first step is to pull out the lfX t : 



1 T 

= E[logL(F + ,X + ; e)] = E[--]T 



(Y t - Z t (I m - l d t )X t - Z t lfX t - a t y 
(Y t - Z t (I m - I t d )X t - Z t I d X t - at) - \ 53 log |R t 



2 
l 

t _ (186) 

to+1 



1 T 

(X t - B t ((I m - lti)X t -i + ltiXt-i) - ut) - - log IQ 



to+1 

- \{Xt a - Z) T HXt log |A| - \ lo g 27r 

See section 17.21 for the definition of 1^ . 

Next we replace I d X t with equation (|182[) . X t() will appear in this function instead of Xt - I rewrite Ut 
as f u t + ~D u t v. This gives us the expected log-likelihood: 

T 

i 



+ = E[logL(F+,X+;e)]= E[-i^ 

l 

(K t - Z t (I m - l d t )X t - ZJ? (B*X t0 + f* + U*v) - a t ) T M t 

1 T 

(F t - Z t (I m - I?)X t - Z t I t d (B t *Jf to + f* + D*«) - a t ) - - V log |R t 



2 ^ 

T 1 (187) 

" \ E ( X * " B *(Pm - ^-i^t-i + ^-iO^-i^o + f t -x + d;_ iW )) - f M , 4 - D„, tU ) T Q f 

to+1 

(JT* - B t ((I m - I?_i)*t-i + It_i(B*_iX t0 + f*_ x + DJ_! +«)) - f ttli - D^t;) 

T 

- - ^ log |Q t | - \ (X t0 - £) T L(X t0 - €) - 1 log | A| - ^ log 2vr 

to 

where B*, f* and D* are defined in equation ([T52"|) . Mi = EjR.- 1 E t and Q f = <S>jQ~ 1 $ t , L = n T A _1 II. 
When Xt is treated as fixed, L — and the last line will drop out altogether, however in general some rows 
of x to could be fixed and others stochastic. 

We can see directly in equation (|187l) where v appears in the expected log-likelihood. Where p appears 
is less obvious because it depends on F, which specifies which rows of Xt are fixed. From equation (|184[) . 

E[X t0 ] = (I m -I^)E[X t0 ]+ll°k 

and £ = f{ +D^p. Thus where p appears in the expected log-likelihood depends on the location of zero rows 
in F (and thus the zero rows in the indicator matrix if ). Recall that E[X to ] appearing in the expected 
log- likelihood function is conditioned on the data so E[-X" to ] in ^ is not equal to £ if Xt is stochastic. 

The case where x to is stochastic is a little odd because conditioned on X to — x to , xf is deterministic even 
though Xq is a random variable in the model. Thus in the model, xf is a random variable through X to . But 
when wc do the partial differentiation step for the EM algorithm, we hold X at its expected value thus we 
are holding X to at a specific value. We cannot do that and change u at the same time because once we fix 
X to the xf are deterministic functions of u. 
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7.4 Logical constraints to ensure a consistent system of equations 

We need to ensure that the model remains internally consistent when R or Q goes to zero and that we do 
not have an over- or under-constrained system. 

As an example of a solvable versus unsolvable model, consider the following. 



H/R/ 



then following are bad versus ok Z matrices. 



"0 


0" 






"0 








0" 


1 







a 0" 







a 











1 




b 










b 
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Z 



bad 



c 


d 





3(2,1) 


z(2,2) 


2(2,3) 


2(3,1) 


2(3,1) 


2(3,1) 


c 


d 






Zok 



c 

z(2,l) 2(2,2) 2(2,3) 

2(3,1) 2(3,1) 2(3,1) 

c di= 



(189) 



Because and j/t(4) have zero observation variance, the first Z reduces to this for x t (l) and x t (2): 







cx t {\) - 


hdx t (2) 


y t (4)_ 




cx t (l) - 


Vdx t {2)_ 



and since y*(l) ^ 2/t(4), potentially, that is not solvable. The second Z reduces to 



Vt(i)" 




CXt(l) 


.tft(4). 




cx t (l) + dx t (4) 



(190) 



(191) 



and that is solvable for any ?/t(l) and yt(4) combination. Notice that in the latter case, Xt(l) and Xt(2) are 
fully specified by y t (l) and j/ t (4). 



7.4.1 Constraint 1: Z does not lead to an over-determined observation process 

We need to ensure that a x t exists for all y[ such that: 

E[yi 0) ] =Z(°»E[i f ]|a(»). 

If Z' -* is invertible, such a Xt certainly exists. But we do not require that only one Xt exists, simply that at 
least one exists. Thus the system can be under-constrained but not over-constrained. One way to test for this 
is to use the singular value decomposition (SVD) of Z^°\ If the number of singular values of Z^ is less than 
the number of columns in Z, which is the number of x rows, then Z^ specifies an over-constrained system 
(y = Using the R language, you would test if the length of svd(Z)$d is less than than dim(Z) [2] . If 

Z*- -* specifies and under-determined system, some of the singular values would be equal to (within machine 
tolerance). It is possible that could specify both an over- and under-determined system at the same 
time. That is, the number of singular values could be less than the number of columns in Z^ and some of 
the singular values could be 0. 

Doesn't a Z with more rows than columns automatically specify a over-determined system? No. Con- 
sidered this Z 

"1 0" 

1 (192) 


This Z is fine, although obviously the last row of y will not hold any information about the x. But it could 
have information about R and a, which might be shared with the other y, so we don't want to prevent the 
user from specifying a Z like this. 



7 This is the classic problem of solving the system of linear equations, which is standardly written Ax - 
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7.4.2 Constraint 2: the state processes are not over-constrained. 

We also need to be concerned with the state process being over-constrained when both Q = and R = 
because we can have a situation where the constraint imposed by the observation process is at odds with 
the constraint imposed by the state process. Here is an example: 



Vt = 



1 


0" 









1 




x% 



Xl 




1 


0" 




Xl 




Wl 


X2_ 


t 










X2_ 


+ 

t-1 






In this case, some of the x's are deterministic, Q = and not linked through B to a stochastic x, and 
the corresponding y are also deterministic. These cases will show up as errors in the Kalman filter/smoother 
because in the Kalman gain equation (equation I139c|) . the term Z t V t t ~ 1 Zj will appear when R = 0. We 
need to make sure that rows in B t , Z t and Q t do not line up in such a way that rows/cols do not appear 
in Z t Vj _1 Z t r at the same place as rows/cols in R. In MARSS, this is checked by doing a pre-run of the 
Kalman smoother to see if it throws an error in the Kalman gain step. 



8 EM algorithm modifications for degenerate models 

The R, Q, Z, and a update equations are largely unchanged. The real difficulties arise for the u and £ 
update equations when u( ) or £^ are estimated. For B, I do not have a degenerate update equation, so I 
need to assume that B^ elements are fixed (not estimated). 

8.1 R and Q update equations 

The constrained update equations for Q and R work fine because their update equations do not involve any 
inverses of non-invertible matrices. However if HtRtH^ is non-diagonal and there are missing values, then 
the R update equation involves y t . That will involve the inverse of HjRnH^ (section l6.2p . which might 
have zeros on the diagonal. In that case, use the Vt modification that deals with such zeros (equation I146p . 

8.2 Z and a update equations 

We need to deal with Z and a elements that appear in rows where the diagonal of R = 0. These values will 
not appear in the likelihood function unless they also happen to also appear on the rows where the diagonal 
of R is not (because they are constrained to be equal for example). However, in this case the Z (0) and a<°) 
are logically constrained by the equation 

y^ = Z^E[ Xt )+4 0) . 

Notice there is no w t since R = for these rows. The E[a; t ] is ML estimate of x t computed in the Kalman 
smoother from the parameter values at iteration i of the EM algorithm, so there is no information in this 
equation for Z and a at iteration i + 1. The nature of the smoother is that it will find the Xt that is most 
consistent with the data. For example if our y = Zx + a equation looks like so 



"0" 




T 


2 




I 



(194) 



there is no x that will solve this. However x = 1 is the closest (lowest squared error) and so this is the 
information in the data about x. The Kalman filter will use this and the relative value of Q and R to come 
up with the estimated x. In this case, R = 0, so the information in the data will completely determine x 
and the smoother would return x = 1 regardless of the process equation. 
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The a and Z update equations require that D f aMt^t.a and D ( z R(Dt )Z are invertible. If ZJ ; 

and are fixed, this will be satisfied, however the restriction is a little less restrictive than that since it 
is possible that R t does not have zeros on the diagonal in the same places so that the sum over t could be 
invertible while the individual values at t are not. The section on the summary of constraints has the test 
for this constraint. 

The update equations also involve y t , and the modified algorithm for y f when H t has all zero rows will 
be needed. Other than that, the constrained update equations work (sections 15.21 and I5TT)) . 



8.3 u update equation 

Here I discuss the update for u, or more specifically v which appears in u, when G t or H t have zero rows. I 
require that vl\ s is not estimated. All the \i\ s are fixed values. The may be estimated or more specifically 
there may be v in that are estimated; uf = f u t + t v. 

For the constrained u update equation with deterministic re's takes the following form. It is similar to 
the unconstrained update equation except that that a part from the y part of the likelihood now appears: 

= ( 2(Aj 2 R t A tl 2 + Aj A Q t A tA )) xfj2 (A t y 2 R t A M + A^ 4 Q { A t , 3 )) (195) 



Concep t ually. I think the approach described here is the similar to the approach presented in section 4.2.5 
of ( Harvevl Il989l ) , but it is more general because it deals with the case where some u elements are shared 



(linear functions of some set of shared values), possibly across deterministic and stochastic elements. Also, 
I present it here within the context of the EM algorithm, so solving for the maximum-likelihood u appears 
in the context of maximizing with respect to u for the update equation at iteration j + 1. 



8.3.1 u(°) is not estimated 

When u^ ) is not estimated (since it is at some user defined value via D u and f„), the part we are estimating, 
u + , only appears in the x part of the likelihood. The update equation for u remains equation (f95|) . 



8.3.2 u d is estimated 

The derivation of the update equation proceeds as usual. We need to take the partial derivative of X P + 
(equation I187[) holding everything constant except v, elements of which might appear in both and uf 
(but not uj s since I require that u l t s has no estimated elements). 

The expected log-likelihood takes the following form, where to is the time where the initial state is defined 
(t = or t = 1): 

T T 

*+ = -~ E(A M - A ti2 t,) T R t (A M - A ti2 v) lo S l R *l 

1 1 

T T 

-- (A t , 3 - A M z,) T Q t (A t , 3 - A tA v) - - E log |Q t | ( 196 ^ 

to+l t +l 

1 In 
--(X t0 - T L(X to - I) - 2 l0 § l A l - 2 l0g2?r 

L = F T A _1 F. If Xt is treated as fixed, F is all zero and the line with L drops out. If some but not all Xt 
are treated as fixed, then only the stochastic rows appear in the last line. In any case, the last line does not 
contain v, thus when we do the partial differentiation with respect to v, this line drops out. 



51 



The A terms are defined as: 



A M = y t - Z t (I m - I*)x t - Z t I?(B t * E[X t0 ] + f*) - a 4 
A t , a = Z 4 I t d D* 

^to,3 = mx i 

A t , 3 = xt - B f (I m - Itjxt-i - B t lti(B t *_! E[X t0 ] + fj_ x ) - f M (197) 

At 0i 4 = O mX mDl,n 

A M =D t , u + B t ltiD*_ 1 
Epr to ] = ((I m -lf)x t0 +lf« 

if, Bj , ft*, and Dj are defined in equation (|182p . The values of these at to is special so that the math works 
out. The expectation ( E) has been subsumed into the As since A2 and A4 do not involve X or Y , so terms 
like X T X never appear. 

Take the derivative of this with respect to v and arrive at: 

T T , T T s 

v j+1 = ( ]T A^ 4 Q t A M + Aj^RtA^y 1 x ( £ A^ 4 Q t A 1)3 + £ A^ a R t A M ) J (198) 
t=i t=i ^ t=i t=i ' 

8.4 £ update equation 

8.4.1 £ is stochastic 

This means that none of the rows of F (in FA) are zero, so is all zero and the update equation reduces 
to a constrained version of the classic £ update equation: 

p i+1 - (■D]A- 1 -Ds)- 1 -DjA-\E[X t0 ] - f c ) (199) 

8.4.2 £ (0) is not estimated 

When £^ is not estimated (because you fixed it as some value), we do not need to take the partial derivative 
with respect to £^ since we will not be estimating it. Thus the update equation is unchanged from the 
constrained update equation. 

8.4.3 £ (0) is estimated 

Using the same approach as for u update equation, we take the derivative of (|187l) with respect to p where 
£ = f{ + DjP- ^ + will take the following form: 

= 

T T 

- ^(A t , 5 - A t , 6 p) T K t (A t , 5 - A t , 6 p) - ^ log |Rt| 

t=i 1 

T T 

- -J2(A ti7 - A t , 8P ) T Q t (A t , 7 - A t , 8 p) - ~ 5>g |Q t | (200) 

t=i 1 

- \(E[X t0 ] - f 6 - Dep) T L(E[X t0 ] - f 5 - D CP ) - i tog |A| 

n 
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The A's are defined as follows using E[Xt ] = (I m — I; )x to + 1;°^ where it appears in if E[Xt]. 
A t)8 =y t - Z t (I ro - I t d )X t - Z 4 I t d (B*((I m - if )z to + ll 0) f e ) + u* t ) - a t 
A tl6 = Z t lfB*li 0) D c 

At ,7 = mx i 

A tl7 = X t - B t (I m - lti)St-i - BJt^B^^a™ - I, (0) )£t + ifff) + uj_0 - u t (201) 

At 0i 8 = OmxmDf 

A t)8 = B^B^lf D 4 
u* = f * + B* t v 

The expectation can be pulled inside the As since the As in front of p do not involve X or Y. 
Take the derivative of this with respect to p and arrive at: 

T T 

*Tt, n r 1 . 



p i+1 = ( A ls®t&t,8 + A le R tA t s + njhr> $ ) 

X 'J x ( 202 ) 



8.4.4 When H 4 has rows in addition to G t 

When Ut has all zero rows, some of the p or v may constrained by the model, but these constraints do not 
appear in ^ + since M. t zeros out those constraints. For example, if H t is all zeros and x\ = £, then £ is 
constrained to equal Z _1 (y 1 — ai). 

The model needs to be internally consistent and we need to be able to estimate all the p and the v. Rather 
than try to estimate the correct p and v to ensure internal consistency of the model with the data when 
some of the H t have rows, I test by running the Kalman filter with the degenerate variance modification 
(in particular the modification for F with zero rows is critical) before starting the EM algorithm. Then I 
test that y t — Z t x t — a t is all zeros. If it is not, within machine accuracy, then there is a problem. This is 
reported and the algorithm stopped^ 

I also test that (X^tli A^ 8 QtA tj8 + J2t=i A/"e^t^t,6 + DjLD^) is invertible to ensure that all the p 
can be solved for, and I test that ( 53t=i A^Qt At. 4 + Y^t=i A^^t At, 2) is invertible so that all the v can be 
solved for. If errors are present, they should be apparent in iteration 1, are reported and the EM algorithm 
stopped. 

8.5 update equation for degenerate models 

I do not have an update equation for B^ -* and for now, I side-step this problem by requiring that any B*- -* 
terms are fixed. 



9 Kalman filter and smoother modifications for degenerate mod- 
els 

9.1 Modifications due to degenerate R and Q 

[1/1/2012 note. These modifications mainly have to do with inverses that appear in the Shumway and 
Stoffer's presentation of the Kalman filter. Later I want to switch to Koopman's smoother algorithm which 

18 In some cases, it is easy to determine the correct £. For example, when Ht is all zero rows, to = 1 and there is no missing 
data at time t = 1, £ = Z*(j/ 1 — ai), where Z* is the pseudoinverse. One would want to use the SVD pseudoinverse calculation 
in case Z leads to an under-constrained system (some of the singular values of Z are 0). 
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avoids these inverses altogether.] 

In principle, when either GtQ t or H(R t has zero rows, the standard Kalman filter/smoother equations 
would still work and provide the correct state outputs and likelihood. In practice however errors will 
be generated because under certain situations, one of the matrix inverses in the Kalman filter/smoother 
equations will involve a matrix with a zero on the diagonal and this will lead to the computer code throwing 
an error. 

When H t R t has zero rows, problems arise in the Kalman update part of the Kalman filter. The Kalman 
gain is 

K t =Vr 1 (Z t *) T (Z t *Vr 1 (ZD T + H i R t *H t T )- 1 (203) 

Here, Z£ is the missing values modified Z t matrix with the i-th rows zero-ed out if the i-th element of y t 
is missing (section 16.11 equation I141[) . Thus if the i-th element of y t is missing and the i-th row of H t is 
zero, the element of (ZjV* _1 (Z( ) T + HfR^H^) will be zero also and one cannot take its inverse. In 
addition, if the initial value X\ is treated as fixed but unknown then will be a m x m matrix of zeros. 
Again in this situation (Z*V* _1 (Z*) T + H t R*H t T ) will have zeros at any elements where the i-th row 
of H f is also zero. 

The first case, where zeros on the diagonal arise due to missing values in the data, can be solved using 
the matrix which pulls out the rows and columns corresponding to the non-missing values (flji ). Replace 
(Z*V* _1 (Z*) T +U t R^Hjy 1 in equation ((203j) with 

(^ 1) ) T (^ 1) (Z:V*- 1 (Z:) T +H 4 R:H7)(^ 1) ) T )^ 1 ^ 1) (204) 

Wrapping in f2j 1 - ) (f2j 1 ' > ) T gets rid of all the zero rows/columns in ZjVj -1 (Zj) t + H t H' t Hj , and the matrix 
is reassembled with the zero rows/columns reinserted by wrapping in (S~2^ 1 ' ) ) T Oj 1 ' ) . This works because R' t 
is the missing values modified R (section 11.31) and is block diagonal across the i and non-i rows/columns, 
and Z' t has the i-columns zero-ed out. Thus removing the i columns and rows before taking the inverse has 
no effect on the product 2i t {...)~ 1 . When V" = 0, set Ki = without computing the inverse (see equation 
12031 where appears on the left). 

There is also a numerical issue to deal with. When the i-th row of H t is zero, some of the elements of 
Xt may be completely specified (fully known) given y t . Let's call these fully known elements of Xt, the fc-th 
elements. In this case, the fc-th row and column of V f must be zero because given yt(i), Xt(k) is known 
(is fixed) and its variance, V t (fc,fc), is zero. Because K t is computed using a numerical estimate of the 
inverse, the standard Vj update equation (which uses K t ) will cause these elements to be close to zero but 
not precisely zero, and they may even be slightly negative on the diagonal. This will cause serious problems 
when the Kalman filter output is passed on to the EM algorithm. Thus after is computed using the 
normal Kalman update equation, we will want to explicitly zero out the k rows and columns in the filter. 

When Gt has zero rows, then we might also have similar numerical errors in J in the Kalman smoother. 
The J equation is 



j^v-jBavrr 1 

where V*" 1 = BtV^Bj + G t Q t Gj 



(205) 



If there are zeros on the diagonals of (A and/or B t ) and zero rows in Gt and these zeros line up, then if 
the b|°' > and BjP elements in B t are blockj^l. there will be zeros on the diagonal of Vj. Thus there will 
be zeros on the diagonal of V£ _1 and it cannot be inverted. In this case, the corresponding elements of 
need to be zero since what's happening is that those elements are deterministic and thus have variance. 

We want to catch these zero variances in V* -1 so that we can take the inverse. Note that this can only 
happen when there are zeros on the diagonal of G t Q t Gj since BfVjljB^ can never be negative on the 



19 This means the following. Let the rows where the diagonal elements in Q equal zero be denoted i and the the rows where 
there are non-zero diagonals be denoted j. The b' ' elements are the Bt elements where both row and column are in i. The 
Bj 1 ^ elements are the B elements where both row and column are in j. If the Bj ' and Bj 1 ' elements in B are blocks, this 
means all the Bt(i, j) are 0; no deterministic components interact with the stochastic components. 
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diagonal since B t B7 must be positive-definite and so is V*_J. The basic idea is the same as above. We 
replace (V^ 1 )" 1 with: 

(n+ t ) T (o+ (v;- 1 )(n+) T )" 1 n+ t (206) 

where Oj f is a matrix that removes all the positive V^ -1 rows analogous to fl^. 
9.2 Modifications due to fixed initial states 

When the initial state of x is fixed, then it is a bit like A = although actually A does not appear in the 
model and £ has a different interpretation. 

When the initial state of x is treated as stochastic, then if to = 0, £ is the expected value of Xo conditioned 
on no data. In the Kalman filter this means x® = £ and Vq = A; in words, the expected value of Xq 
conditioned on y is £ and the variance of xjj conditioned on y Q is A. When to = 1, then £ is the expected 
value of Xi conditioned on no data. In the Kalman filter this means x^ = $ and V° = A. Thus where £ and 
A appear in the Kalman filter equations is different depending on to ; the x\ and N\ initial condition versus 
the x\~ l and V^ _1 initial condition. 

When some or all of the x to are fixed, denoted the I^x to , the fixed values are not a random variables. 
While technically speaking, the expected value of a fixed value does not exist, we can think of it as a random 
variable with a probability density function with all the weight on the fixed value. Thus I^E[x to ] = £ 
regardless of the data. The data have no information for I^Xt since we fix I^Xt at If to = 0, we 

initialize the Kalman filter as usual with x$ = £ and Vq = FAF T , where the fixed x to rows correspond to 
the zero row/columns in FAF T . The Kalman filter will return the correct expectations even when some of 
the diagonals of HRH T or GQG T are — with the constraint that we have no purely deterministic elements 
in the model (meaning there are no errors terms from cither R or Q). 

When t = 1, I^x? and l[ 0) x} = £ regardless of the data and V? = FAF T and Vj = FAF T , where the 
fixed rows of X\ correspond with the row/columns in FAF T . We also set I^Ki, meaning the rows of X\ 
that are fixed, to all zero because Ki is the information in y 1 regarding X\ and there is no information in 
the data regarding the values of X\ that are fixed to equal I^ ^- 

With ~V\, x\ and Ki set to their correct initial values, the normal Kalman filter equations will work fine. 
However it is possible for the data at t = 1 to be inconsistent with the model if the rows of y 1 corresponding 
to any zero row/columns in ZiFAF T Z ] r + HiRiH^ are not equal to Zi£ + ai. Here is a trivial example, 
let the model be x t = x t -i + w t , yt = x t , Xi = 1. Then if y\ is anything except 1, the model is impossible. 
Technically, the likelihood of x\ conditioned on Y± = y\ does not exist since neither x\ nor y\ are realizations 
of a random variable (since they are fixed), so when the likelihood is computed using the innovations form 
of the likelihood, the t = 1 does not appear, at least for those y l corresponding to any zero row/columns 
in ZiFAF T Z ] r + HiRiH^- Thus these internal inconsistencies would neither provoke an error nor cause 
Inf to be returned for the likelihood. In the MARSS package, the Kalman filter has been modified to return 
LL=Inf and an error. 

10 Summary of requirements for degenerate models 

Below are discussed the update equations for the different parameters. Here I summarize the constraints that 
are scattered throughout these subsections. These requirements are coded into the function MARSSkem- 
checkQ in the MARSS package but some tests must be repeated in the function degen.test(), which tests if 
any of the R or Q diagonals can be set to zero if it appears they are going to zero. A model that is allowed 
when R and Q are non-zero, might be disallowed if R or Q diagonals were to be set to zero. degen.testQ 
does this check. 

• (I m ®l£°)ZtI^)Dt ,b, is all zeros; is all zeros. If there is a all zero row in H t and it is linked (through Z) to 
a all zero row in Gt, then the corresponding B t elements are fixed instead of estimated. Corresponding 



B rows means those rows in B where there is a non-zero column in Z. We need I}° } Z t I^B t to only 

specify fixed B t elements, which means vec(I^,°- ) Z t I^B t I„ l ) only specifies fixed values. This in turn 
leads to the condition above. MARSSkcmcheckQ 

• (Ii ® l[°^Z t I^)D t;M is all zeros; if there is a all zero row in H t and it is linked (through Z t ) to a all 
zero row in G t , then the corresponding u t elements are fixed instead of estimated. MARSSkcmcheck() 

• (I TO ® I^°')D tj 2, where is all zeros; if y has no observation error, then the corresponding Z t rows are 
fixed values. (I m ® I^) is a diagonal matrix with Is for the rows of T> t ,z that correspond to elements 
of Z t on the R = rows. MARSSkemcheck() 

• (Ii(g>l£°))D t;0 is all zeros; if y has no observation error, then the corresponding a t rows are fixed values. 
MARSSkemcheck() 

• (I TO I^)Dt ; (, is all zeros. This means B*- -* (the whole row) is fixed. While B d could potentially be 
estimated potentially, my derivation assumes it is not. MARSSkcmcheckQ 

• (Ii®I" t>m )Dt )tt is all zeros. This means u is is fixed. Here is is defined as those rows that are indirectly 
stochastic at time to, where m is the dimension of B; it can take up to m steps for the is rows to be 
connected to the s rows through B. MARSSkcmchcck() 

• If u' ) or £^ are being estimated, then the adjacency matrices defined by B t ^ are not time- 
varying. This means that the locations of the Os in B t are not changing over time. B t however may 
be time-varying. MARSSkemcheck() 

• 1^ and I^ -* are time invariant (an imposed assumption). This means that the location of the rows 
in G t and H t (and thus in w t and v t ) are not changing through time. It would be easy enough to 
allow I^ ** to be time varying, but to make my derivation easier, I assume it is time constant. 

• zf^ in E[y^] = zj ^ E[X t ] + a|°^ does not imply an over-determined system of equations. Be- 
cause the Vt rows are zero for the (0) rows of y, it must be possible for this equality to hold. This 
means that z[ cannot specify an over-determined system although an underdetermined system is ok. 
MARSSkemcheckQ checks by examining the singlular values of z[ returned from the singlular value 
decomposition (svd). The number of singlular values must not be less than m (columns of Z). If it 
is less than m, it means the equation system is over-determined. Singular values equal to are ok; it 
means the system is under-determined given only the observation equation, but that's ok because we 
also have the state equation will determine the under states and the Kalman smoother will presumably 
throw an error if the state process is under-determined (if that would even make sense...). 

• The state process cannot be over-determined via constraints imposed from the deterministic observation 
process (R = 0) and the deterministic state process (Q = 0). If this is the case the Kalman gain 
equation (in the Kalman filter) will throw an error. Checked in MARSSQ via call to MARSSkf() 
before fitting call; degen.testQ, in MARSSkem() will also test via MARSSkf call if some R or Q are 
attempted to be set to 0. If B or Z changes during kem or optim iterations such that this constraint 
does not hold, then algorithm will exit with an error message. 

• The location of the 0s in B are time- invariant. The B can be time-varying but not the location of 0s. 
Also, I want B to be such that once a row becomes indirectly stochastic is stays that way. For example, 
if B = [ i J] , then row 2 flips back and forth from being indirectly stochastic to deterministic. 

The dimension of the identity matrices in the above constraints is given by the subscript on I except when 
it is implicit. 
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11 Implementation comments 



The EM algorithm is a hill-climbing algorithm and like all hill-climbing algorithms it can get stuck on local 
maxima. There are a number approaches to doing a pre- search of the ini t ial co nditions space, but a brute 



force random Monte Carol search appears to work well (jBiernacki et all [2003). It is slow, but normally 
sufficient. In my experience, Monte Carlo initial conditions searches become important as the fraction of 
missing data in the data set increases. Certainly an initial conditions search should be done before reporting 
final estimates for an analysis. However in oui0 studies on the distributional properties of parameter 
estimates, we rarely found it necessary to do an initial conditions search. 

The EM algorithm will quickly home in on parameter estimates that are close to the maximum, but once 
the values are close, the EM algorithm can slow to a crawl. Some researchers start with an EM algorithm 
to get close to the maximum-likelihood parameters and then switch to a quasi-Newton method for the final 
search. In many ecological applications, parameter estimates that differ by less than 3 decimal places are for 
all practical purposes the same. Thus we have not used the quasi-Newton final search. 

Shumway and Stoffer (2006; chapter 6) imply in their discussion of the EM algorithm that both £ and A 
can be estimated, though not simultaneously. Harvey (1989), in contrast, discusses that there are only two 
allowable cases for the initial conditions: 1) fixed but unknown and 2) a initial condition set as a prior. In 
case 1, £ is Xq (or X\) and is then estimated as a parameter; A is held fixed at 0. In case 2, | and A specify 
the mean and variance of Xq (or X\) respectively. Neither are estimated; instead, they are specified as part 
of the model. 

As mentioned in the introduction, misspecification of the prior on Xq can have catastrophic and unde- 
tectable effects on your parameter estimates. For many MARSS models, you will never see this problem. 
However, if you are fitting models that imply a correlation structure between the hidden states (i.e. the 
variance-covariance matrix of the X's is not diagonal), then your prior can definitely create problems if it 
does not have the same correlation structure as that implied by your MLE model. A common default is 
to use a prior with a diagonal variance-covariance matrix. This can lead to serious problems if the im- 
plied variance-covariance of the X'a is not diagonal. A diffuse prior does not get around this since it has a 
correlation structure also even if it has infinite variance. 

One way you can detect that you have a problem is to start the EM algorithm at the outputs from a 
Newton-esque algorithm. If the EM estimates diverge and the likelihood drops, you have a problem. Here 
are a few suggestions for getting around the problem: 

• Treat x as an estimated parameter and set V o =0. If the model is not stable going backwards in time, 
then treat Xi as the estimated parameter; this will allow the data to constrain the Xi estimate (since 
there is no data at t = 0, Xq has no data to constrain it). 

• Try a diffuse prior, but first read the info in the KFAS R package about diffuse priors since MARSS 
uses the KFAS implementation. In particular, note that you will still be imposing an information 
on the correlation structure using a diffuse prior; whatever Vo you use is telling the algorithm what 
correlation structure to use. If there is a mismatch between the correlation structure in the prior 
and the correlation structure implied by the MLE model, you will not be escaping the prior problem. 
But sometimes you will know your implied correlation structure. For example, you may know that 
the x's are independent or you may be able to solve for the stationary distribution a priori if your 
stationary distribution is not a function of the parameters you are trying to estimate. Other times you 
are estimating a parameter that determines the correlation structure (like B) and you will not know a 
priori what the correlation structure is. 

In some cases, the update equation for one parameter needs other parameters. Technically, the Kalman 



filter / smoother should be run between each parameter update, however following iGhahramani and Hinton 



( 1996f ) the default MARSS algorithm skips this step (unless the user sets control$saf e=TRUE) and each 



updated parameter is used for subsequent update equations. If you see warnings that the log-likelihood 
drops, then try setting control$saf e=TRUE. This will increase computation time greatly. 



"Our" and "we" in this section means work and papers by E. E. Holmes and E.J. Ward. 
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12 MARSS R package 



R code for the Kalman filter, Kalman smoother, and EM algorithm is provided as a separate R package, 
MARSS, available on CRAN ( [http://cran.r-project.org/web/packages7M ARSS). MARSS was developed by 
Elizabeth Holmes, Eric Ward and Kellie Wills and provides maximum-likelihood estimation and model- 
selection for both unconstrained and constrained MARSS models. The package contains a detailed user 
guide which shows various applications. In addition to model fitting via the EM algorithm, the package 
provides algorithms for bootstrapping, confidence intervals, auxiliary residuals, and model selection criteria. 
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